HTTP crawlers
Generic class AbstractHttpCrawler
is parent to BeautifulSoupCrawler
, ParselCrawler
and HttpCrawler
and it could be used as parent for your crawler with custom content parsing requirements.
It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions.
BeautifulSoupCrawler
BeautifulSoupCrawler
uses BeautifulSoupParser
to parse the HTTP response and makes it available in BeautifulSoupCrawlingContext
in the .soup
or .parsed_content
attribute.
ParselCrawler
ParselCrawler
uses ParselParser
to parse the HTTP response and makes it available in ParselCrawlingContext
in the .selector
or .parsed_content
attribute.
HttpCrawler
HttpCrawler
uses NoParser
that does not parse the HTTP response at all and is to be used if no parsing is required.
Creating your own HTTP crawler.
Why?
In case you want to use some custom parser for parsing HTTP responses, and the rest of the AbstractHttpCrawler
functionality suit your needs.
How?
You need to define at least 2 new classes and decide what will be the type returned by the parser's parse
method.
Parser will inherit from AbstractHttpParser
and it will need to implement all it's abstract methods.
Crawler will inherit from AbstractHttpCrawler
and it will need to implement all it's abstract methods.
Newly defined parser is then used in the parser
argument of AbstractHttpCrawler.__init__
method.
To get better idea and as an example please see one of our own HTTP-based crawlers mentioned above.