High-level overview¶
The Magpie application is composed of the following components:
The GUI application¶
This is the interface which we show the user. It shows our list of bookmarks (twigs) in a nice interface and allows us to launch specific actions, such as adding bookmarks, updating the collection, etc.
The application manages a database of all the Twigs that we have in our collection.
Some tasks will be run directly by the app itself, but resource consuming ones will be delegated to helper processes that will run those tasks asynchronously in the background. This keeps the application responsive at all times.
The database¶
The database is composed of the main root folder that contains our collection of twigs and subfolders of twigs. A Twig is like a bookmark to a URL, but it also has related pages and other content specific data, such as org and repo name for a git repo link, comments and link to the discussed article in a HN discussion, etc.
See also the data model reference.
UUID¶
Each Folder, Twig and Url is uniquely identified by a UUID associated with it.
Callbacks can use this id to know which node they need to update in the database, eg: in DataRetriever.fetch_url()
We might want to find a way to add data to the DB without callback, maybe by passing the path where to store the resulting data (check https://sopherapps.github.io/pydantic-redis/ for inspiration?).
It is also used to identify objects in the semantic models.
As we would like to have UUIDs on more, potentially small, objects but do not want to bloat the database with too many UUIDs, we propose the following scheme to attribute UUIDs to member components of a object with an UUID:
Warning
At the moment the following is not yet implemented, UUIDs are regular UUIDv4.
The uuid is composed of 96 random bits and 32 bits of indexing data, which point to member objects of the main object. The format of these bits can vary for each object individually.
Out of those 32 bits, the first 8 is an index of a property of the object, and the other 24 serve as sub-indexing purposes, like for chunking, arrays, etc.
for Url:
01000000: url.info02000000: url.content02000001: url.content, 1st chunk02000002: url.content, 2nd chunk…
03000000: url.semantic
The Task system¶
Some time-consuming tasks will not be performed by the main Magpie GUI app but will be delegated to some helper worker processes. When those workers finish their work, we fetch the results from the queue system and need to be notified in the main app so we can update the database with the results we just fetched.
We are using the Celery task queue system.
The main user of the task system is the DataRetriever
The DataRetriever can perform multiple tasks on the database through the use of
Fetcher plugins, like:
identify a URL and find an associated fetcher that is able to retrieve more specific information
extract information purely from the URL
extract additional content for specific content type
extract semantic information for specific URLs
expand the database by creating new twigs or adding related URLs to pre-existing twigs
Fetcher plugins¶
the identification of the specific link type and the retrieval of specific attributes is done through a plugin system.
Each plugin declares whether it can handle a specific URL (through its
match() method) and if so, implements as many content
extraction methods as possible. It should also define an Info class and/or a Content class
that they will return as result of calling their extract_info() and fetch_additional_info()
methods.
Each fetcher plugin needs to implement the following interface: DataFetcher.
These plugins implement each method synchronously (ie: as simple as possible, they have no
knowledge of the task queue system). It is the DataRetriever that will send tasks
wrapping calls to the plugins to the task queue system.
Task execution monitoring¶
After submitting a task to the task queue system, our job might not necessarily be done. In particular, often times we will want to get back the result of the computation and store it in the database, so we need a way to be notified back in the app when one of the workers finishes executing a task.
This is handled by the task monitor system, which allows us to submits tasks to the queue while registering a callback to be called upon completion of said task.