Skip to main content

HTTP crawlers

HTTP crawlers are ideal for extracting data from server-rendered websites that don't require JavaScript execution. These crawlers make requests via HTTP clients to fetch HTML content and then parse it using various parsing libraries. For client-side rendered content, where you need to execute JavaScript consider using Playwright crawler instead.

Overviewโ€‹

All HTTP crawlers share a common architecture built around the AbstractHttpCrawler base class. The main differences lie in the parsing strategy and the context provided to request handlers. There are BeautifulSoupCrawler, ParselCrawler, and HttpCrawler. It can also be extended to create custom crawlers with specialized parsing requirements. They use HTTP clients to fetch page content and parsing libraries to extract data from the HTML, check out the HTTP clients guide to learn about the HTTP clients used by these crawlers, how to switch between them, and how to create custom HTTP clients tailored to your specific requirements.

BeautifulSoupCrawlerโ€‹

The BeautifulSoupCrawler uses the BeautifulSoup library for HTML parsing. It provides fault-tolerant parsing that handles malformed HTML, automatic character encoding detection, and supports CSS selectors, tag navigation, and custom search functions. Use this crawler when working with imperfect HTML structures, when you prefer BeautifulSoup's intuitive API, or when prototyping web scraping solutions.

Run on
import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
# Create a BeautifulSoupCrawler instance
crawler = BeautifulSoupCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)

# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract data using BeautifulSoup
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}

# Push extracted data to the dataset
await context.push_data(data)

# Enqueue links found on the page for further crawling
await context.enqueue_links()

# Run the crawler
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

ParselCrawlerโ€‹

The ParselCrawler uses the Parsel library, which provides XPath 1.0 and CSS selector support built on lxml for high performance. It includes built-in regex support for pattern matching, proper XML namespace handling, and offers better performance than BeautifulSoup while maintaining a clean API. Use this crawler when you need XPath functionality, require high-performance parsing, or need to extract data using regular expressions.

Run on
import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext


async def main() -> None:
# Create a ParselCrawler instance
crawler = ParselCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)

# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Extract data using Parsel's XPath and CSS selectors
data = {
'url': context.request.url,
'title': context.selector.xpath('//title/text()').get(),
}

# Push extracted data to the dataset
await context.push_data(data)

# Enqueue links found on the page for further crawling
await context.enqueue_links()

# Run the crawler
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

HttpCrawlerโ€‹

The HttpCrawler provides direct access to HTTP response body and headers without automatic parsing, offering maximum performance with no parsing overhead. It supports any content type (JSON, XML, binary) and allows complete control over response processing, including memory-efficient handling of large responses. Use this crawler when working with non-HTML content, requiring maximum performance, implementing custom parsing logic, or needing access to raw response data.

Run on
import asyncio
import re

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
# Create an HttpCrawler instance - no automatic parsing
crawler = HttpCrawler(
# Limit the crawl to 10 requests
max_requests_per_crawl=10,
)

# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

# Get the raw response content
response_body = await context.http_response.read()
response_text = response_body.decode('utf-8')

# Extract title manually using regex (since we don't have a parser)
title_match = re.search(
r'<title[^>]*>([^<]+)</title>', response_text, re.IGNORECASE
)
title = title_match.group(1).strip() if title_match else None

# Extract basic information
data = {
'url': context.request.url,
'title': title,
}

# Push extracted data to the dataset
await context.push_data(data)

# Simple link extraction for further crawling
href_pattern = r'href=["\']([^"\']+)["\']'
matches = re.findall(href_pattern, response_text, re.IGNORECASE)

# Enqueue first few links found (limit to avoid too many requests)
for href in matches[:3]:
if href.startswith('http') and 'crawlee.dev' in href:
await context.add_requests([href])

# Run the crawler
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Using custom parsersโ€‹

Since HttpCrawler provides raw HTTP responses, you can integrate any parsing library. Note that helpers like enqueue_links and extract_links are not available with this approach.

The following examples demonstrate how to integrate with several popular parsing libraries, including lxml (high-performance parsing with XPath 1.0), lxml with SaxonC-HE (XPath 3.1 support), selectolax (high-speed CSS selectors), PyQuery (jQuery-like syntax), and scrapling (a Scrapy/Parsel-style API offering BeautifulSoup-like methods).

Run on
import asyncio

from lxml import html
from pydantic import ValidationError

from crawlee import Request
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
crawler = HttpCrawler(
max_request_retries=1,
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Parse the HTML content using lxml.
parsed_html = html.fromstring(await context.http_response.read())

# Extract data from the page.
data = {
'url': context.request.url,
'title': parsed_html.findtext('.//title'),
'h1s': [h1.text_content() for h1 in parsed_html.findall('.//h1')],
'h2s': [h2.text_content() for h2 in parsed_html.findall('.//h2')],
'h3s': [h3.text_content() for h3 in parsed_html.findall('.//h3')],
}
await context.push_data(data)

# Convert relative URLs to absolute before extracting links.
parsed_html.make_links_absolute(context.request.url, resolve_base_href=True)

# Xpath 1.0 selector for extracting valid href attributes.
links_xpath = (
'//a/@href[not(starts-with(., "#")) '
'and not(starts-with(., "javascript:")) '
'and not(starts-with(., "mailto:"))]'
)

extracted_requests = []

# Extract links.
for url in parsed_html.xpath(links_xpath):
try:
request = Request.from_url(url)
except ValidationError as exc:
context.log.warning(f'Skipping invalid URL "{url}": {exc}')
continue
extracted_requests.append(request)

# Add extracted requests to the queue with the same-domain strategy.
await context.add_requests(extracted_requests, strategy='same-domain')

await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Custom HTTP crawlerโ€‹

While the built-in crawlers cover most use cases, you might need a custom HTTP crawler for specialized parsing requirements. To create a custom HTTP crawler, inherit directly from AbstractHttpCrawler. This approach requires implementing:

  1. Custom parser class: Inherit from AbstractHttpParser.
  2. Custom context class: Define what data and helpers are available to handlers.
  3. Custom crawler class: Tie everything together.

This approach is recommended when you need tight integration between parsing and the crawling context, or when you're building a reusable crawler for a specific technology or format.

The following example demonstrates how to create a custom crawler using selectolax with the Lexbor engine.

Parser implementationโ€‹

The parser converts HTTP responses into a parsed document and provides methods for element selection. Implement AbstractHttpParser using selectolax with required methods for parsing and querying:

selectolax_parser.py
from __future__ import annotations

import asyncio
from typing import TYPE_CHECKING

from selectolax.lexbor import LexborHTMLParser, LexborNode
from typing_extensions import override

from crawlee.crawlers._abstract_http import AbstractHttpParser

if TYPE_CHECKING:
from collections.abc import Iterable, Sequence

from crawlee.http_clients import HttpResponse


class SelectolaxLexborParser(AbstractHttpParser[LexborHTMLParser, LexborNode]):
"""Parser for parsing HTTP response using Selectolax Lexbor."""

@override
async def parse(self, response: HttpResponse) -> LexborHTMLParser:
"""Parse HTTP response body into a document object."""
response_body = await response.read()
# Run parsing in a thread to avoid blocking the event loop.
return await asyncio.to_thread(LexborHTMLParser, response_body)

@override
async def parse_text(self, text: str) -> LexborHTMLParser:
"""Parse raw HTML string into a document object."""
return LexborHTMLParser(text)

@override
async def select(
self, parsed_content: LexborHTMLParser, selector: str
) -> Sequence[LexborNode]:
"""Select elements matching a CSS selector."""
return tuple(item for item in parsed_content.css(selector))

@override
def is_matching_selector(
self, parsed_content: LexborHTMLParser, selector: str
) -> bool:
"""Check if any element matches the selector."""
return parsed_content.css_first(selector) is not None

@override
def find_links(
self, parsed_content: LexborHTMLParser, selector: str
) -> Iterable[str]:
"""Extract href attributes from elements matching the selector.

Used by `enqueue_links` helper to discover URLs.
"""
link: LexborNode
urls: list[str] = []
for link in parsed_content.css(selector):
url = link.attributes.get('href')
if url:
urls.append(url.strip())
return urls

This is enough to use your parser with AbstractHttpCrawler.create_parsed_http_crawler_class factory method. For more control, continue with custom context and crawler classes below.

Crawling context definition (optional)โ€‹

The crawling context is passed to request handlers and provides access to the parsed content. Extend ParsedHttpCrawlingContext to define the interface your handlers will work with. Here you can implement additional helpers for the crawler context.

selectolax_context.py
from dataclasses import dataclass, fields

from selectolax.lexbor import LexborHTMLParser
from typing_extensions import Self

from crawlee.crawlers._abstract_http import ParsedHttpCrawlingContext


# Custom context for Selectolax parser, you can add your own methods here
# to facilitate working with the parsed document.
@dataclass(frozen=True)
class SelectolaxLexborContext(ParsedHttpCrawlingContext[LexborHTMLParser]):
"""Crawling context providing access to the parsed page.

This context is passed to request handlers and includes all standard
context methods (push_data, enqueue_links, etc.) plus custom helpers.
"""

@property
def parser(self) -> LexborHTMLParser:
"""Convenient alias for accessing the parsed document."""
return self.parsed_content

@classmethod
def from_parsed_http_crawling_context(
cls, context: ParsedHttpCrawlingContext[LexborHTMLParser]
) -> Self:
"""Create custom context from the base context.

Copies all fields from the base context to preserve framework
functionality while adding custom interface.
"""
return cls(
**{field.name: getattr(context, field.name) for field in fields(context)}
)

Crawler compositionโ€‹

The crawler class connects the parser and context. Extend AbstractHttpCrawler and configure the context pipeline to use your custom components:

selectolax_crawler.py
from __future__ import annotations

from typing import TYPE_CHECKING

from selectolax.lexbor import LexborHTMLParser, LexborNode

from crawlee.crawlers import AbstractHttpCrawler, HttpCrawlerOptions

from .selectolax_context import SelectolaxLexborContext
from .selectolax_parser import SelectolaxLexborParser

if TYPE_CHECKING:
from collections.abc import AsyncGenerator

from typing_extensions import Unpack

from crawlee.crawlers._abstract_http import ParsedHttpCrawlingContext


# Custom crawler using custom context, It is optional and you can use
# AbstractHttpCrawler directly with SelectolaxLexborParser if you don't need
# any custom context methods.
class SelectolaxLexborCrawler(
AbstractHttpCrawler[SelectolaxLexborContext, LexborHTMLParser, LexborNode]
):
"""Custom crawler using Selectolax Lexbor for HTML parsing."""

def __init__(
self,
**kwargs: Unpack[HttpCrawlerOptions[SelectolaxLexborContext]],
) -> None:
# Final step converts the base context to custom context type.
async def final_step(
context: ParsedHttpCrawlingContext[LexborHTMLParser],
) -> AsyncGenerator[SelectolaxLexborContext, None]:
# Yield custom context wrapping with additional functionality around the base
# context.
yield SelectolaxLexborContext.from_parsed_http_crawling_context(context)

# Build context pipeline: HTTP request -> parsing -> custom context.
kwargs['_context_pipeline'] = (
self._create_static_content_crawler_pipeline().compose(final_step)
)
super().__init__(
parser=SelectolaxLexborParser(),
**kwargs,
)

Crawler usageโ€‹

The custom crawler works like any built-in crawler. Request handlers receive your custom context with full access to framework helpers like enqueue_links. Additionally, the custom parser can be used with AdaptivePlaywrightCrawler for adaptive crawling:

import asyncio

from .selectolax_crawler import SelectolaxLexborContext, SelectolaxLexborCrawler


async def main() -> None:
crawler = SelectolaxLexborCrawler(
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def handle_request(context: SelectolaxLexborContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

data = {
'url': context.request.url,
'title': context.parser.css_first('title').text(),
}

await context.push_data(data)
await context.enqueue_links()

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Conclusionโ€‹

This guide provided a comprehensive overview of HTTP crawlers in Crawlee. You learned about the three main crawler types - BeautifulSoupCrawler for fault-tolerant HTML parsing, ParselCrawler for high-performance extraction with XPath and CSS selectors, and HttpCrawler for raw response processing. You also discovered how to integrate third-party parsing libraries with HttpCrawler and how to create fully custom crawlers using AbstractHttpCrawler for specialized parsing requirements.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!