Skip to main content

HTTP crawlers

HTTP crawlers are ideal for extracting data from server-rendered websites that don't require JavaScript execution. These crawlers make requests via HTTP clients to fetch HTML content and then parse it using various parsing libraries. For client-side rendered content, where you need to execute JavaScript consider using Playwright crawler instead.

Overview

All HTTP crawlers share a common architecture built around the AbstractHttpCrawler base class. The main differences lie in the parsing strategy and the context provided to request handlers. There are BeautifulSoupCrawler, ParselCrawler, and HttpCrawler. It can also be extended to create custom crawlers with specialized parsing requirements. They use HTTP clients to fetch page content and parsing libraries to extract data from the HTML, check out the HTTP clients guide to learn about the HTTP clients used by these crawlers, how to switch between them, and how to create custom HTTP clients tailored to your specific requirements.

BeautifulSoupCrawler

The BeautifulSoupCrawler uses the BeautifulSoup library for HTML parsing. It provides fault-tolerant parsing that handles malformed HTML, automatic character encoding detection, and supports CSS selectors, tag navigation, and custom search functions. Use this crawler when working with imperfect HTML structures, when you prefer BeautifulSoup's intuitive API, or when prototyping web scraping solutions.

Run on
import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
# Create a BeautifulSoupCrawler instance
crawler = BeautifulSoupCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)

# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract data using BeautifulSoup
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}

# Push extracted data to the dataset
await context.push_data(data)

# Enqueue links found on the page for further crawling
await context.enqueue_links()

# Run the crawler
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

ParselCrawler

The ParselCrawler uses the Parsel library, which provides XPath 1.0 and CSS selector support built on lxml for high performance. It includes built-in regex support for pattern matching, proper XML namespace handling, and offers better performance than BeautifulSoup while maintaining a clean API. Use this crawler when you need XPath functionality, require high-performance parsing, or need to extract data using regular expressions.

Run on
import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext


async def main() -> None:
# Create a ParselCrawler instance
crawler = ParselCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)

# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract data using Parsel's XPath and CSS selectors
data = {
'url': context.request.url,
'title': context.selector.xpath('//title/text()').get(),
}

# Push extracted data to the dataset
await context.push_data(data)

# Enqueue links found on the page for further crawling
await context.enqueue_links()

# Run the crawler
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

HttpCrawler

The HttpCrawler provides direct access to HTTP response body and headers without automatic parsing, offering maximum performance with no parsing overhead. It supports any content type (JSON, XML, binary) and allows complete control over response processing, including memory-efficient handling of large responses. Use this crawler when working with non-HTML content, requiring maximum performance, implementing custom parsing logic, or needing access to raw response data.

Run on
import asyncio
import re

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
# Create an HttpCrawler instance - no automatic parsing
crawler = HttpCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)

# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Get the raw response content
response_body = await context.http_response.read()
response_text = response_body.decode('utf-8')

# Extract title manually using regex (since we don't have a parser)
title_match = re.search(
r'<title[^>]*>([^<]+)</title>', response_text, re.IGNORECASE
)
title = title_match.group(1).strip() if title_match else None

# Extract basic information
data = {
'url': context.request.url,
'title': title,
}

# Push extracted data to the dataset
await context.push_data(data)

# Simple link extraction for further crawling
href_pattern = r'href=["\']([^"\']+)["\']'
matches = re.findall(href_pattern, response_text, re.IGNORECASE)

# Enqueue first few links found (limit to avoid too many requests)
for href in matches[:3]:
if href.startswith('http') and 'crawlee.dev' in href:
await context.add_requests([href])

# Run the crawler
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Creating custom HTTP crawler

While the built-in crawlers cover most use cases, you might need a custom HTTP crawler for specialized parsing requirements. To create a custom HTTP crawler, inherit directly from AbstractHttpCrawler. This approach requires implementing:

  1. Custom parser class: Inherit from AbstractHttpParser.
  2. Custom context class: Define what data and helpers are available to handlers.
  3. Custom crawler class: Tie everything together.

This approach is recommended when you need tight integration between parsing and the crawling context, or when you're building a reusable crawler for a specific technology or format.

Conclusion

This guide provided a comprehensive overview of HTTP crawlers in Crawlee. You learned about the three main crawler types - BeautifulSoupCrawler for fault-tolerant HTML parsing, ParselCrawler for high-performance extraction with XPath and CSS selectors, and HttpCrawler for raw response processing. You also discovered how to create custom crawlers for specific use cases.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!