magpie.fetch.retriever¶
Module Contents¶
Classes¶
The |
Functions¶
Computes url embedding, synchronous |
|
Computes embeddings for a batch of URLs, synchronous |
API¶
- magpie.fetch.retriever.fetch_additional_info(url: magpie.datamodel.Url)[source]¶
- magpie.fetch.retriever.embed_url_sync(url: magpie.datamodel.Url) tuple[magpie.datamodel.SemanticModel, list][source]¶
Computes url embedding, synchronous
- magpie.fetch.retriever.embed_url_batch_sync(urls: list[magpie.datamodel.Url])[source]¶
Computes embeddings for a batch of URLs, synchronous
- class magpie.fetch.retriever.DataRetriever[source]¶
The
DataRetrieveris the main class that is used to fetch additional data/content for a given twig or folder of twigs.It contains 3 main methods, which will gather the 3 types of information a
Urlcan have:identify(): parses the URL and identifies a fetcher for it, as well as extracting the information it can from the URL only. This yields a subclass ofUrlInformation. This operation is synchronous and run on the main process as it should be fast.fetch(): fetches additional content (web page download, additional resources) for a given URL which type has been previously identified. This yields a subclass ofContentInformation. This is asynchronous and sent to the task queue.get_semantic_info(): runs LLMs or other AI models to extract semantic information. This yields aSemanticInformationinstance. This is asynchronous and sent to the task queue.
For all asynchronous tasks, you can still decide to wait on their completion by calling the
wait_for_tasks_completion()method.- get_fetcher_for(url: magpie.datamodel.Url) magpie.datamodel.DataFetcher | None[source]¶
- identify_url(url: magpie.datamodel.Url)[source]¶
Try all the registered fetchers and see if they match the given URL. If they do, then add the extracted info to it.
- identify(folder: magpie.datamodel.Folder)[source]¶
Take a folder as input and try to identify the types of URLs in all the Twigs in that folder (and subfolders).
- fetch_url(url: magpie.datamodel.Url, callback=None)[source]¶
Try all the registered fetchers and see if they match the given URL. If they do, fetch additional info about the URL and add it to it.
- fetch(folder: magpie.datamodel.Folder, callback=None)[source]¶
Take a folder as input and fetches the content of the URLs for all the Twigs in that folder (and subfolders).
If provided, the callback must have the following signature:
def callback(int, int, str)where the args are respectively: (completed, total, msg)Note: this must be done after
identify()has been called. Based on the identified URLs, the appropriate fetcher plugin will be used
- expand_data(folder: magpie.datamodel.Folder)[source]¶
Take a folder as input and passes it to all registered Fetchers so they can call their own
DataFetcher.expand_data(folder)method on it.This allows the fetchers to not only find more information for a single URL, but to manipulate the whole database in order to reorganize it in case they need to.
It is a good idea to call
DataRetriever.identify(folder)andDataRetriever.fetch(folder)after this in order to get content for additional twigs that might have been created.
- get_semantic_info(folder: magpie.datamodel.Folder, callback=None)[source]¶
Computes url embedding based on downloaded content for all urls in folder If provided, the callback must have the following signature:
def callback(int, int, str)where the args are respectively: (completed, total, msg)Note: this must be done after
fetch()has been called.
- embed_url(url: magpie.datamodel.Url, callback=None)[source]¶
Computes url embedding based on downloaded content for given url in folder Launches celery task and get results in callback
- embed_url_batch(urls: list[magpie.datamodel.Url], callback=None)[source]¶
Computes url embedding based on downloaded content for given url in folder Launches celery task and get results in callback