magpie.fetch.retriever

Module Contents

Classes

DataRetriever

The DataRetriever is the main class that is used to fetch additional data/content for a given twig or folder of twigs.

Functions

fetch_additional_info

embed_url_sync

Computes url embedding, synchronous

embed_url_batch_sync

Computes embeddings for a batch of URLs, synchronous

API

magpie.fetch.retriever.fetch_additional_info(url: magpie.datamodel.Url)[source]
magpie.fetch.retriever.embed_url_sync(url: magpie.datamodel.Url) tuple[magpie.datamodel.SemanticModel, list][source]

Computes url embedding, synchronous

magpie.fetch.retriever.embed_url_batch_sync(urls: list[magpie.datamodel.Url])[source]

Computes embeddings for a batch of URLs, synchronous

class magpie.fetch.retriever.DataRetriever[source]

The DataRetriever is the main class that is used to fetch additional data/content for a given twig or folder of twigs.

It contains 3 main methods, which will gather the 3 types of information a Url can have:

  • identify(): parses the URL and identifies a fetcher for it, as well as extracting the information it can from the URL only. This yields a subclass of UrlInformation. This operation is synchronous and run on the main process as it should be fast.

  • fetch(): fetches additional content (web page download, additional resources) for a given URL which type has been previously identified. This yields a subclass of ContentInformation. This is asynchronous and sent to the task queue.

  • get_semantic_info(): runs LLMs or other AI models to extract semantic information. This yields a SemanticInformation instance. This is asynchronous and sent to the task queue.

For all asynchronous tasks, you can still decide to wait on their completion by calling the wait_for_tasks_completion() method.

get_fetcher_for(url: magpie.datamodel.Url) magpie.datamodel.DataFetcher | None[source]
identify_url(url: magpie.datamodel.Url)[source]

Try all the registered fetchers and see if they match the given URL. If they do, then add the extracted info to it.

identify(folder: magpie.datamodel.Folder)[source]

Take a folder as input and try to identify the types of URLs in all the Twigs in that folder (and subfolders).

fetch_url(url: magpie.datamodel.Url, callback=None)[source]

Try all the registered fetchers and see if they match the given URL. If they do, fetch additional info about the URL and add it to it.

fetch(folder: magpie.datamodel.Folder, callback=None)[source]

Take a folder as input and fetches the content of the URLs for all the Twigs in that folder (and subfolders).

If provided, the callback must have the following signature:

def callback(int, int, str) where the args are respectively: (completed, total, msg)

Note: this must be done after identify() has been called. Based on the identified URLs, the appropriate fetcher plugin will be used

apply_with_callback(method, callback, urls, msg)[source]
wait_for_tasks_completion(timeout=None)[source]
expand_data(folder: magpie.datamodel.Folder)[source]

Take a folder as input and passes it to all registered Fetchers so they can call their own DataFetcher.expand_data(folder) method on it.

This allows the fetchers to not only find more information for a single URL, but to manipulate the whole database in order to reorganize it in case they need to.

It is a good idea to call DataRetriever.identify(folder) and DataRetriever.fetch(folder) after this in order to get content for additional twigs that might have been created.

get_semantic_info(folder: magpie.datamodel.Folder, callback=None)[source]

Computes url embedding based on downloaded content for all urls in folder If provided, the callback must have the following signature:

def callback(int, int, str) where the args are respectively: (completed, total, msg)

Note: this must be done after fetch() has been called.

embed_url(url: magpie.datamodel.Url, callback=None)[source]

Computes url embedding based on downloaded content for given url in folder Launches celery task and get results in callback

embed_url_batch(urls: list[magpie.datamodel.Url], callback=None)[source]

Computes url embedding based on downloaded content for given url in folder Launches celery task and get results in callback