`magpie.fetch.retriever`¶

Module Contents¶

Classes¶

DataRetriever

The DataRetriever is the main class that is used to fetch additional data/content for a given twig or folder of twigs.

Functions¶

`fetch_additional_info`
`embed_url_sync`	Computes url embedding, synchronous
`embed_url_batch_sync`	Computes embeddings for a batch of URLs, synchronous

API¶

magpie.fetch.retriever.fetch_additional_info(url: magpie.datamodel.Url)[source]¶

magpie.fetch.retriever.embed_url_sync(url: magpie.datamodel.Url) → tuple[magpie.datamodel.SemanticModel, list][source]¶: Computes url embedding, synchronous

magpie.fetch.retriever.embed_url_batch_sync(urls: list[magpie.datamodel.Url])[source]¶: Computes embeddings for a batch of URLs, synchronous

class magpie.fetch.retriever.DataRetriever[source]¶

The DataRetriever is the main class that is used to fetch additional data/content for a given twig or folder of twigs.

It contains 3 main methods, which will gather the 3 types of information a Url can have:

identify(): parses the URL and identifies a fetcher for it, as well as extracting the information it can from the URL only. This yields a subclass of UrlInformation. This operation is synchronous and run on the main process as it should be fast.
fetch(): fetches additional content (web page download, additional resources) for a given URL which type has been previously identified. This yields a subclass of ContentInformation. This is asynchronous and sent to the task queue.
get_semantic_info(): runs LLMs or other AI models to extract semantic information. This yields a SemanticInformation instance. This is asynchronous and sent to the task queue.

For all asynchronous tasks, you can still decide to wait on their completion by calling the wait_for_tasks_completion() method.

get_fetcher_for(url: magpie.datamodel.Url) → magpie.datamodel.DataFetcher | None[source]¶

identify_url(url: magpie.datamodel.Url)[source]¶: Try all the registered fetchers and see if they match the given URL. If they do, then add the extracted info to it.

identify(folder: magpie.datamodel.Folder)[source]¶: Take a folder as input and try to identify the types of URLs in all the Twigs in that folder (and subfolders).

fetch_url(url: magpie.datamodel.Url, callback=None)[source]¶: Try all the registered fetchers and see if they match the given URL. If they do, fetch additional info about the URL and add it to it.

fetch(folder: magpie.datamodel.Folder, callback=None)[source]¶

Take a folder as input and fetches the content of the URLs for all the Twigs in that folder (and subfolders).

If provided, the callback must have the following signature:

def callback(int, int, str) where the args are respectively: (completed, total, msg)

Note: this must be done after identify() has been called. Based on the identified URLs, the appropriate fetcher plugin will be used

apply_with_callback(method, callback, urls, msg)[source]¶

wait_for_tasks_completion(timeout=None)[source]¶

expand_data(folder: magpie.datamodel.Folder)[source]¶

Take a folder as input and passes it to all registered Fetchers so they can call their own DataFetcher.expand_data(folder) method on it.

This allows the fetchers to not only find more information for a single URL, but to manipulate the whole database in order to reorganize it in case they need to.

It is a good idea to call DataRetriever.identify(folder) and DataRetriever.fetch(folder) after this in order to get content for additional twigs that might have been created.

get_semantic_info(folder: magpie.datamodel.Folder, callback=None)[source]¶

Computes url embedding based on downloaded content for all urls in folder If provided, the callback must have the following signature: