HTTP crawlers
HTTP crawlers are ideal for extracting data from server-rendered websites that don't require JavaScript execution. These crawlers make requests via HTTP clients to fetch HTML content and then parse it using various parsing libraries. For client-side rendered content, where you need to execute JavaScript consider using Playwright crawler instead.
Overview
All HTTP crawlers share a common architecture built around the AbstractHttpCrawler
base class. The main differences lie in the parsing strategy and the context provided to request handlers. There are BeautifulSoupCrawler
, ParselCrawler
, and HttpCrawler
. It can also be extended to create custom crawlers with specialized parsing requirements. They use HTTP clients to fetch page content and parsing libraries to extract data from the HTML, check out the HTTP clients guide to learn about the HTTP clients used by these crawlers, how to switch between them, and how to create custom HTTP clients tailored to your specific requirements.
BeautifulSoupCrawler
The BeautifulSoupCrawler
uses the BeautifulSoup library for HTML parsing. It provides fault-tolerant parsing that handles malformed HTML, automatic character encoding detection, and supports CSS selectors, tag navigation, and custom search functions. Use this crawler when working with imperfect HTML structures, when you prefer BeautifulSoup's intuitive API, or when prototyping web scraping solutions.
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
# Create a BeautifulSoupCrawler instance
crawler = BeautifulSoupCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)
# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
# Extract data using BeautifulSoup
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
# Push extracted data to the dataset
await context.push_data(data)
# Enqueue links found on the page for further crawling
await context.enqueue_links()
# Run the crawler
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
ParselCrawler
The ParselCrawler
uses the Parsel library, which provides XPath 1.0 and CSS selector support built on lxml
for high performance. It includes built-in regex support for pattern matching, proper XML namespace handling, and offers better performance than BeautifulSoup while maintaining a clean API. Use this crawler when you need XPath functionality, require high-performance parsing, or need to extract data using regular expressions.
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
async def main() -> None:
# Create a ParselCrawler instance
crawler = ParselCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)
# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
# Extract data using Parsel's XPath and CSS selectors
data = {
'url': context.request.url,
'title': context.selector.xpath('//title/text()').get(),
}
# Push extracted data to the dataset
await context.push_data(data)
# Enqueue links found on the page for further crawling
await context.enqueue_links()
# Run the crawler
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
HttpCrawler
The HttpCrawler
provides direct access to HTTP response body and headers without automatic parsing, offering maximum performance with no parsing overhead. It supports any content type (JSON, XML, binary) and allows complete control over response processing, including memory-efficient handling of large responses. Use this crawler when working with non-HTML content, requiring maximum performance, implementing custom parsing logic, or needing access to raw response data.
import asyncio
import re
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
async def main() -> None:
# Create an HttpCrawler instance - no automatic parsing
crawler = HttpCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)
# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
# Get the raw response content
response_body = await context.http_response.read()
response_text = response_body.decode('utf-8')
# Extract title manually using regex (since we don't have a parser)
title_match = re.search(
r'<title[^>]*>([^<]+)</title>', response_text, re.IGNORECASE
)
title = title_match.group(1).strip() if title_match else None
# Extract basic information
data = {
'url': context.request.url,
'title': title,
}
# Push extracted data to the dataset
await context.push_data(data)
# Simple link extraction for further crawling
href_pattern = r'href=["\']([^"\']+)["\']'
matches = re.findall(href_pattern, response_text, re.IGNORECASE)
# Enqueue first few links found (limit to avoid too many requests)
for href in matches[:3]:
if href.startswith('http') and 'crawlee.dev' in href:
await context.add_requests([href])
# Run the crawler
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
Creating custom HTTP crawler
While the built-in crawlers cover most use cases, you might need a custom HTTP crawler for specialized parsing requirements. To create a custom HTTP crawler, inherit directly from AbstractHttpCrawler
. This approach requires implementing:
- Custom parser class: Inherit from
AbstractHttpParser
. - Custom context class: Define what data and helpers are available to handlers.
- Custom crawler class: Tie everything together.
This approach is recommended when you need tight integration between parsing and the crawling context, or when you're building a reusable crawler for a specific technology or format.
Conclusion
This guide provided a comprehensive overview of HTTP crawlers in Crawlee. You learned about the three main crawler types - BeautifulSoupCrawler
for fault-tolerant HTML parsing, ParselCrawler
for high-performance extraction with XPath and CSS selectors, and HttpCrawler
for raw response processing. You also discovered how to create custom crawlers for specific use cases.
If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!