magpie.fetchers.generic¶
Module Contents¶
Classes¶
API¶
- class magpie.fetchers.generic.Info[source]¶
Bases:
magpie.datamodel.Base
- class magpie.fetchers.generic.Content[source]¶
Bases:
magpie.datamodel.ContentInformationBaseContent for typical webpages or blogs that are not specialized We reproduce the main fields extracted by https://github.com/adbar/trafilatura
- data: str¶
None
- title: str | None¶
None
- description: str | None¶
None
- date: str | None¶
None
- class magpie.fetchers.generic.Fetcher[source]¶
Bases:
magpie.datamodel.DataFetcher- match(url: magpie.datamodel.Url) bool[source]¶
- clean_page(content: str) magpie.fetchers.generic.Content[source]¶
Generic webpage data extraction using trafilatura default extraction settings
- fetch_additional_info(url: magpie.datamodel.Url) magpie.fetchers.generic.Content[source]¶
- expand_data(folder: magpie.datamodel.Folder)[source]¶