magpie.datamodel¶
Module Contents¶
Classes¶
Metaclass that automatically will tag the class being instantiated with the name of the module it is defined in. |
|
A base class holding some common settings. |
|
Use this base class for defining models that need to have a |
|
This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents. |
|
A URL with optional additional information. |
|
The |
|
A Folder has a name and a list of sub-folders and |
|
The DataFetcher class is the base interface for all plugins that want to fetch specific content. |
Functions¶
Data¶
Objects deriving from our |
|
A |
|
This class represents the information that can be extracted from the URL alone. |
|
This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents. |
API¶
- magpie.datamodel.PathSegment¶
None
Objects deriving from our
Basestruct also implement path traversal using one of the following ways to go through an object:fragment is a
strand starts with a leading dot ‘.’ => we return the attribute of the object which name is the fragment with the leading dot removedfragment is an
int=> we returnobj[frag]fragment is a
str=> we returnobj[frag]
- magpie.datamodel.Path¶
None
A
Pathconsists of a list ofPathSegment.
- class magpie.datamodel.FormatSpec[source]¶
Bases:
msgspec.Struct- space: str¶
None
- newline: str¶
None
- start_bold: str¶
None
- end_bold: str¶
None
- class magpie.datamodel.TaggedStructMeta[source]¶
Bases:
msgspec.StructMetaMetaclass that automatically will tag the class being instantiated with the name of the module it is defined in.
This will be useful for the
Fetcherplugins so we don’t have to specify a tag manually, which they need to be able to be embedded into the tagged union type.This will only be applied to classes defined in submodule of
magpie.fetchers.
- class magpie.datamodel.Base[source]¶
Bases:
msgspec.StructA base class holding some common settings.
We set
omit_defaults = Trueto omit any fields containing only their default value from the output when encoding.We set
forbid_unknown_fields = Trueto error nicely if an unknown field is present in the serialized data. This helps catch typo errors early.
- TEXT_FORMAT¶
‘FormatSpec(…)’
- ANSI_FORMAT¶
‘FormatSpec(…)’
- HTML_FORMAT¶
‘FormatSpec(…)’
- as_plain_text() str[source]¶
Return an unstyled text representation suitable for outputting to a file.
- follow(path: magpie.datamodel.Path | magpie.datamodel.PathSegment) magpie.datamodel.Base¶
Follow the given path or path segment starting from
selfand return the resulting object.
- class magpie.datamodel.WithID[source]¶
Bases:
magpie.datamodel.BaseUse this base class for defining models that need to have a
uuidfield.- uuid: magpie.datamodel.WithID.uuid¶
‘field(…)’
- magpie.datamodel.UrlInformation¶
None
This class represents the information that can be extracted from the URL alone.
It is initially defined as
Nonebut will be created dynamically when loading the fetcher plugins to be the union of all of the fetchersInfoclasses. It allows us to fully import thedatamodelmodule and make it available to plugins.This is fine as annotations are evaluated lazily and will only be required to be correct when validating structs, and not when importing modules.
- class magpie.datamodel.ContentInformationBase[source]¶
Bases:
magpie.datamodel.BaseThis class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents.
It contains at least the
datafield, which is the content pointed at by this URL (html code most of the time, but could be binary, e.g.: for PDFs, images, etc.)- data: str | None¶
None
- magpie.datamodel.ContentInformation¶
None
This class represents the additional information that can be fetched for this URL, by downloading and/or parsing its contents.
It is initially defined as
ContentInformationBasebut will be created dynamically when loading the fetcher plugins to be the union of all of the fetchersContentclasses. It allows us to fully import thedatamodelmodule and make it available to plugins.This is fine as annotations are evaluated lazily and will only be required to be correct when validating structs, and not when importing modules.
- class magpie.datamodel.SemanticModel[source]¶
Bases:
magpie.datamodel.Base- model: str¶
None
- prompt: str | None¶
None
- settings: dict | None¶
None
- class magpie.datamodel.SemanticBaseInformation[source]¶
Bases:
msgspec.Struct- model: magpie.datamodel.SemanticModel¶
None
- class magpie.datamodel.SemanticTags[source]¶
Bases:
magpie.datamodel.SemanticBaseInformation- content: list[str]¶
None
- class magpie.datamodel.SemanticSummary[source]¶
Bases:
magpie.datamodel.SemanticBaseInformation- content: str¶
None
- class magpie.datamodel.SemanticEmbedding[source]¶
Bases:
magpie.datamodel.SemanticBaseInformation- content: list[float]¶
None
- class magpie.datamodel.SemanticInformation[source]¶
Bases:
msgspec.Struct- tags: magpie.datamodel.SemanticTags | None¶
None
- summary: magpie.datamodel.SemanticSummary | None¶
None
- embedding: magpie.datamodel.SemanticEmbedding | None¶
None
- class magpie.datamodel.Url[source]¶
Bases:
magpie.datamodel.WithIDA URL with optional additional information.
- value: str¶
None
- url_type: str | None¶
None
- info: magpie.datamodel.UrlInformation | None¶
None
- content: magpie.datamodel.ContentInformation | None¶
None
- semantic: magpie.datamodel.SemanticInformation | None¶
None
- class magpie.datamodel.TimestampMixin[source]¶
Bases:
magpie.datamodel.Base- created_at: datetime.datetime¶
‘field(…)’
- updated_at: datetime.datetime¶
‘field(…)’
- class magpie.datamodel.Twig[source]¶
Bases:
magpie.datamodel.WithIDThe
Twigclass is the main data item in MagPie.It can be thought of as an augmented bookmark. Instead of pointing to a single url, it stands more generally for an item of interest, usually a webpage but possibly more, that we want to remember.
For example, a web page we want to bookmark could be associated with others that are semantically related, such as HackerNews/Reddit discussions, GitHub readme page for a software project, etc.
- title: str¶
None
- url: magpie.datamodel.Url¶
None
[]
- rating: int¶
0
- tags: list[str]¶
[]
- notes: str | None¶
None
- class magpie.datamodel.Folder[source]¶
Bases:
magpie.datamodel.WithIDA Folder has a name and a list of sub-folders and
Twigcontained inside it.It is iterable and indexable, both by item position and object name, ie:
folder = Folder() # add some more elements into it... folder.add(Twig(title='magpie website', url=Url('https://magpie.digitalgaia.net'))) first = folder[0] # this works magpie = folder['magpie website'] # this works too
- name: str¶
None
- items: list[magpie.datamodel.Folder | magpie.datamodel.Twig]¶
[]
- add(item: magpie.datamodel.Folder | magpie.datamodel.Twig)[source]¶
- static from_urls(urls: list[str]) magpie.datamodel.Folder[source]¶
Build a new
Folderwith the given list of URLs str.
- find(uuid: magpie.datamodel.Folder.find.uuid) magpie.datamodel.WithID¶
Return the object reachable from this folder with the given UUID.
- Raises:
ValueError – if no such object could be found
- find_path(obj: magpie.datamodel.WithID | None = None, uuid: magpie.datamodel.Folder.find_path.uuid | None = None, as_str: bool = False) magpie.datamodel.Path¶
Return a path to the given object, or the object identified by its UUID. You need to specify either
objoruuidbut not both.- Parameters:
as_str – if true, return the PathSegments as str (name/title) when applicable, or as integers otherwise
- Raises:
ValueError – if both or none of
objanduuidis specifiedValueError – if no such object could be found
- iter_twigs(*, depth_first=True)¶
- iter_urls(*, depth_first=True)¶
- iter_tree(*, depth_first=True, depth=0)¶
- iter_with_id(path=None)¶
- magpie.datamodel.match_name(obj: magpie.datamodel.Twig | magpie.datamodel.Folder, name: str) bool[source]¶
- class magpie.datamodel.DataFetcher[source]¶
The DataFetcher class is the base interface for all plugins that want to fetch specific content.
They need to at least implement the
match()method to declare whether they can handle a certain URL.- classmethod name()[source]¶
Return an identifying name for this DataFetcher. It is extracted as the last part of the module name in which this class is defined.
- abstractmethod match(url: magpie.datamodel.Url) bool[source]¶
Return whether this DataFetcher is able to extract more information from the given URL.
- extract_info(url: magpie.datamodel.Url) magpie.datamodel.UrlInformation[source]¶
Extract semantic information purely from the URL, without having to download or analyze any more resources. This is typically done using regex matching and should be cheap to compute. You can assume that
self.match(url) is Trueif this method is called... rubric:: Examples
a GitHub data fetcher would return (org_name, repo_name),
a HackerNews data fetcher would return the discussion ID.
- fetch_additional_info(url: magpie.datamodel.Url) magpie.datamodel.ContentInformation[source]¶
Extract more information from the given URL, usually by downloading and parsing some additional content or related pages.
.. rubric:: Examples
a GitHub data fetcher would download the Readme file of a project
a HackerNews fetcher would download comments from a discussion and store them as a tree of comments, which are tuples of (author, date, comment)
- extract_semantic_info(url: magpie.datamodel.Url) magpie.datamodel.SemanticInformation[source]¶
Use all the information we have about this URL (extracted using the other methods) and feed that to an LLM to extract semantic information.
This includes: summary, tags, embedding, etc.
- expand_data(folder: magpie.datamodel.Folder)[source]¶
Try to expand the database by passing the whole folder to each registered fetcher and asking them if they can do it.
This allows the fetchers to not only find more information for a single URL, but to manipulate the whole database in order to reorganize it in case they need to.
It is a good idea to call
DataRetriever.identify(folder)andDataRetriever.fetch(folder)after this in order to get content for additional twigs that might have been created.