High-level overview

The Magpie application is composed of the following components:

The GUI application

This is the interface which we show the user. It shows our list of bookmarks (twigs) in a nice interface and allows us to launch specific actions, such as adding bookmarks, updating the collection, etc.

The application manages a database of all the Twigs that we have in our collection.

Some tasks will be run directly by the app itself, but resource consuming ones will be delegated to helper processes that will run those tasks asynchronously in the background. This keeps the application responsive at all times.

The database

The database is composed of the main root folder that contains our collection of twigs and subfolders of twigs. A Twig is like a bookmark to a URL, but it also has related pages and other content specific data, such as org and repo name for a git repo link, comments and link to the discussed article in a HN discussion, etc.

See also the data model reference.

UUID

Each Folder, Twig and Url is uniquely identified by a UUID associated with it.

Callbacks can use this id to know which node they need to update in the database, eg: in DataRetriever.fetch_url()

We might want to find a way to add data to the DB without callback, maybe by passing the path where to store the resulting data (check https://sopherapps.github.io/pydantic-redis/ for inspiration?).

It is also used to identify objects in the semantic models.

As we would like to have UUIDs on more, potentially small, objects but do not want to bloat the database with too many UUIDs, we propose the following scheme to attribute UUIDs to member components of a object with an UUID:

Warning

At the moment the following is not yet implemented, UUIDs are regular UUIDv4.

The uuid is composed of 96 random bits and 32 bits of indexing data, which point to member objects of the main object. The format of these bits can vary for each object individually.

Out of those 32 bits, the first 8 is an index of a property of the object, and the other 24 serve as sub-indexing purposes, like for chunking, arrays, etc.

for Url:

  • 01000000: url.info

  • 02000000: url.content

  • 02000001: url.content, 1st chunk

  • 02000002: url.content, 2nd chunk

  • 03000000: url.semantic

The Task system

Some time-consuming tasks will not be performed by the main Magpie GUI app but will be delegated to some helper worker processes. When those workers finish their work, we fetch the results from the queue system and need to be notified in the main app so we can update the database with the results we just fetched.

We are using the Celery task queue system.

The main user of the task system is the DataRetriever

The DataRetriever can perform multiple tasks on the database through the use of Fetcher plugins, like:

  • identify a URL and find an associated fetcher that is able to retrieve more specific information

  • extract information purely from the URL

  • extract additional content for specific content type

  • extract semantic information for specific URLs

  • expand the database by creating new twigs or adding related URLs to pre-existing twigs

Fetcher plugins

the identification of the specific link type and the retrieval of specific attributes is done through a plugin system.

Each plugin declares whether it can handle a specific URL (through its match() method) and if so, implements as many content extraction methods as possible. It should also define an Info class and/or a Content class that they will return as result of calling their extract_info() and fetch_additional_info() methods.

Each fetcher plugin needs to implement the following interface: DataFetcher.

These plugins implement each method synchronously (ie: as simple as possible, they have no knowledge of the task queue system). It is the DataRetriever that will send tasks wrapping calls to the plugins to the task queue system.

Task execution monitoring

After submitting a task to the task queue system, our job might not necessarily be done. In particular, often times we will want to get back the result of the computation and store it in the database, so we need a way to be notified back in the app when one of the workers finishes executing a task.

This is handled by the task monitor system, which allows us to submits tasks to the queue while registering a callback to be called upon completion of said task.