magpie.fetchers.generic

Module Contents

Classes

Info

Content

Content for typical webpages or blogs that are not specialized We reproduce the main fields extracted by https://github.com/adbar/trafilatura

Fetcher

API

class magpie.fetchers.generic.Info[source]

Bases: magpie.datamodel.Base

class magpie.fetchers.generic.Content[source]

Bases: magpie.datamodel.ContentInformationBase

Content for typical webpages or blogs that are not specialized We reproduce the main fields extracted by https://github.com/adbar/trafilatura

data: str

None

title: str | None

None

description: str | None

None

date: str | None

None

snapshot()[source]
class magpie.fetchers.generic.Fetcher[source]

Bases: magpie.datamodel.DataFetcher

match(url: magpie.datamodel.Url) bool[source]
clean_page(content: str) magpie.fetchers.generic.Content[source]

Generic webpage data extraction using trafilatura default extraction settings

fetch_additional_info(url: magpie.datamodel.Url) magpie.fetchers.generic.Content[source]
expand_data(folder: magpie.datamodel.Folder)[source]