Skip to main content

HTTP crawlers

Generic class AbstractHttpCrawler is parent to BeautifulSoupCrawler, ParselCrawler and HttpCrawler and it could be used as parent for your crawler with custom content parsing requirements.

It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions.

BeautifulSoupCrawler

BeautifulSoupCrawler uses BeautifulSoupParser to parse the HTTP response and makes it available in BeautifulSoupCrawlingContext in the .soup or .parsed_content attribute.

ParselCrawler

ParselCrawler uses ParselParser to parse the HTTP response and makes it available in ParselCrawlingContext in the .selector or .parsed_content attribute.

HttpCrawler

HttpCrawler uses NoParser that does not parse the HTTP response at all and is to be used if no parsing is required.

Creating your own HTTP crawler.

Why?

In case you want to use some custom parser for parsing HTTP responses, and the rest of the AbstractHttpCrawler functionality suit your needs.

How?

You need to define at least 2 new classes and decide what will be the type returned by the parser's parse method. Parser will inherit from AbstractHttpParser and it will need to implement all it's abstract methods. Crawler will inherit from AbstractHttpCrawler and it will need to implement all it's abstract methods. Newly defined parser is then used in the parser argument of AbstractHttpCrawler.__init__ method.

To get better idea and as an example please see one of our own HTTP-based crawlers mentioned above.