magpie.datamodel

Module Contents

Classes

FormatSpec

TaggedStructMeta

Metaclass that automatically will tag the class being instantiated with the name of the module it is defined in.

Base

A base class holding some common settings.

WithID

Use this base class for defining models that need to have a uuid field.

ContentInformationBase

This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents.

SemanticModel

SemanticBaseInformation

SemanticTags

SemanticSummary

SemanticEmbedding

SemanticInformation

Url

A URL with optional additional information.

TimestampMixin

Twig

The Twig class is the main data item in MagPie.

Folder

A Folder has a name and a list of sub-folders and Twig contained inside it.

DataFetcher

The DataFetcher class is the base interface for all plugins that want to fetch specific content.

Functions

Data

PathSegment

Objects deriving from our Base struct also implement path traversal using one of the following ways to go through an object:

Path

A Path consists of a list of PathSegment.

UrlInformation

This class represents the information that can be extracted from the URL alone.

ContentInformation

This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents.

API

magpie.datamodel.PathSegment

None

Objects deriving from our Base struct also implement path traversal using one of the following ways to go through an object:

  • fragment is a str and starts with a leading dot ‘.’ => we return the attribute of the object which name is the fragment with the leading dot removed

  • fragment is an int => we return obj[frag]

  • fragment is a str => we return obj[frag]

magpie.datamodel.Path

None

A Path consists of a list of PathSegment.

class magpie.datamodel.FormatSpec[source]

Bases: msgspec.Struct

space: str

None

newline: str

None

start_bold: str

None

end_bold: str

None

class magpie.datamodel.TaggedStructMeta[source]

Bases: msgspec.StructMeta

Metaclass that automatically will tag the class being instantiated with the name of the module it is defined in.

This will be useful for the Fetcher plugins so we don’t have to specify a tag manually, which they need to be able to be embedded into the tagged union type.

This will only be applied to classes defined in submodule of magpie.fetchers.

class magpie.datamodel.Base[source]

Bases: msgspec.Struct

A base class holding some common settings.

  • We set omit_defaults = True to omit any fields containing only their default value from the output when encoding.

  • We set forbid_unknown_fields = True to error nicely if an unknown field is present in the serialized data. This helps catch typo errors early.

TEXT_FORMAT

‘FormatSpec(…)’

ANSI_FORMAT

‘FormatSpec(…)’

HTML_FORMAT

‘FormatSpec(…)’

as_plain_text() str[source]

Return an unstyled text representation suitable for outputting to a file.

as_text() str[source]

Return a text representation suitable for printing in a terminal.

as_html() str[source]

Return an HTML representation.

to_dict()[source]

Return the object as dictionary of builtin types.

follow(path: magpie.datamodel.Path | magpie.datamodel.PathSegment) magpie.datamodel.Base

Follow the given path or path segment starting from self and return the resulting object.

class magpie.datamodel.WithID[source]

Bases: magpie.datamodel.Base

Use this base class for defining models that need to have a uuid field.

uuid: magpie.datamodel.WithID.uuid

‘field(…)’

magpie.datamodel.UrlInformation

None

This class represents the information that can be extracted from the URL alone.

It is initially defined as None but will be created dynamically when loading the fetcher plugins to be the union of all of the fetchers Info classes. It allows us to fully import the datamodel module and make it available to plugins.

This is fine as annotations are evaluated lazily and will only be required to be correct when validating structs, and not when importing modules.

class magpie.datamodel.ContentInformationBase[source]

Bases: magpie.datamodel.Base

This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents.

It contains at least the data field, which is the content pointed at by this URL (html code most of the time, but could be binary, e.g.: for PDFs, images, etc.)

data: str | None

None

magpie.datamodel.ContentInformation

None

This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents.

It is initially defined as ContentInformationBase but will be created dynamically when loading the fetcher plugins to be the union of all of the fetchers Content classes. It allows us to fully import the datamodel module and make it available to plugins.

This is fine as annotations are evaluated lazily and will only be required to be correct when validating structs, and not when importing modules.

class magpie.datamodel.SemanticModel[source]

Bases: magpie.datamodel.Base

model: str

None

prompt: str | None

None

settings: dict | None

None

class magpie.datamodel.SemanticBaseInformation[source]

Bases: msgspec.Struct

model: magpie.datamodel.SemanticModel

None

class magpie.datamodel.SemanticTags[source]

Bases: magpie.datamodel.SemanticBaseInformation

content: list[str]

None

class magpie.datamodel.SemanticSummary[source]

Bases: magpie.datamodel.SemanticBaseInformation

content: str

None

class magpie.datamodel.SemanticEmbedding[source]

Bases: magpie.datamodel.SemanticBaseInformation

content: list[float]

None

class magpie.datamodel.SemanticInformation[source]

Bases: msgspec.Struct

tags: magpie.datamodel.SemanticTags | None

None

summary: magpie.datamodel.SemanticSummary | None

None

embedding: magpie.datamodel.SemanticEmbedding | None

None

class magpie.datamodel.Url[source]

Bases: magpie.datamodel.WithID

A URL with optional additional information.

value: str

None

url_type: str | None

None

info: magpie.datamodel.UrlInformation | None

None

content: magpie.datamodel.ContentInformation | None

None

semantic: magpie.datamodel.SemanticInformation | None

None

class magpie.datamodel.TimestampMixin[source]

Bases: magpie.datamodel.Base

created_at: datetime.datetime

‘field(…)’

updated_at: datetime.datetime

‘field(…)’

class magpie.datamodel.Twig[source]

Bases: magpie.datamodel.WithID

The Twig class is the main data item in MagPie.

It can be thought of as an augmented bookmark. Instead of pointing to a single url, it stands more generally for an item of interest, usually a webpage but possibly more, that we want to remember.

For example, a web page we want to bookmark could be associated with others that are semantically related, such as HackerNews/Reddit discussions, GitHub readme page for a software project, etc.

title: str

None

url: magpie.datamodel.Url

None

related: list[magpie.datamodel.Url]

[]

rating: int

0

tags: list[str]

[]

notes: str | None

None

class magpie.datamodel.Folder[source]

Bases: magpie.datamodel.WithID

A Folder has a name and a list of sub-folders and Twig contained inside it.

It is iterable and indexable, both by item position and object name, ie:

folder = Folder()  # add some more elements into it...
folder.add(Twig(title='magpie website', url=Url('https://magpie.digitalgaia.net')))
first = folder[0]  # this works
magpie = folder['magpie website']  # this works too
name: str

None

items: list[magpie.datamodel.Folder | magpie.datamodel.Twig]

[]

add(item: magpie.datamodel.Folder | magpie.datamodel.Twig)[source]
remove(index: int | str)[source]
static from_urls(urls: list[str]) magpie.datamodel.Folder[source]

Build a new Folder with the given list of URLs str.

find(uuid: magpie.datamodel.Folder.find.uuid) magpie.datamodel.WithID

Return the object reachable from this folder with the given UUID.

Raises:

ValueError – if no such object could be found

find_path(obj: magpie.datamodel.WithID | None = None, uuid: magpie.datamodel.Folder.find_path.uuid | None = None, as_str: bool = False) magpie.datamodel.Path

Return a path to the given object, or the object identified by its UUID. You need to specify either obj or uuid but not both.

Parameters:

as_str – if true, return the PathSegments as str (name/title) when applicable, or as integers otherwise

Raises:
  • ValueError – if both or none of obj and uuid is specified

  • ValueError – if no such object could be found

iter_twigs(*, depth_first=True)
iter_urls(*, depth_first=True)
iter_tree(*, depth_first=True, depth=0)
iter_with_id(path=None)
magpie.datamodel.match_name(obj: magpie.datamodel.Twig | magpie.datamodel.Folder, name: str) bool[source]
class magpie.datamodel.DataFetcher[source]

The DataFetcher class is the base interface for all plugins that want to fetch specific content.

They need to at least implement the match() method to declare whether they can handle a certain URL.

classmethod name()[source]

Return an identifying name for this DataFetcher. It is extracted as the last part of the module name in which this class is defined.

abstractmethod match(url: magpie.datamodel.Url) bool[source]

Return whether this DataFetcher is able to extract more information from the given URL.

extract_info(url: magpie.datamodel.Url) magpie.datamodel.UrlInformation[source]

Extract semantic information purely from the URL, without having to download or analyze any more resources. This is typically done using regex matching and should be cheap to compute. You can assume that self.match(url) is True if this method is called.

.. rubric:: Examples

  • a GitHub data fetcher would return (org_name, repo_name),

  • a HackerNews data fetcher would return the discussion ID.

fetch_additional_info(url: magpie.datamodel.Url) magpie.datamodel.ContentInformation[source]

Extract more information from the given URL, usually by downloading and parsing some additional content or related pages.

.. rubric:: Examples

  • a GitHub data fetcher would download the Readme file of a project

  • a HackerNews fetcher would download comments from a discussion and store them as a tree of comments, which are tuples of (author, date, comment)

extract_semantic_info(url: magpie.datamodel.Url) magpie.datamodel.SemanticInformation[source]

Use all the information we have about this URL (extracted using the other methods) and feed that to an LLM to extract semantic information.

This includes: summary, tags, embedding, etc.

expand_data(folder: magpie.datamodel.Folder)[source]

Try to expand the database by passing the whole folder to each registered fetcher and asking them if they can do it.

This allows the fetchers to not only find more information for a single URL, but to manipulate the whole database in order to reorganize it in case they need to.

It is a good idea to call DataRetriever.identify(folder) and DataRetriever.fetch(folder) after this in order to get content for additional twigs that might have been created.