Skip to main content

HTTP clients

HTTP clients are utilized by the HTTP-based crawlers (e.g. BeautifulSoupCrawler) to communicate with web servers. They use external HTTP libraries for communication, rather than a browser. Examples of such libraries include httpx, aiohttp or curl-cffi. After retrieving page content, an HTML parsing library is typically used to facilitate data extraction. Examples of such libraries are beautifulsoup, parsel, selectolax or pyquery. These crawlers are faster than browser-based crawlers but they cannot execute client-side JavaScript.

How to switch between HTTP clients

In Crawlee we currently have two HTTP clients: HttpxHttpClient, which uses the httpx library, and CurlImpersonateHttpClient, which uses the curl-cffi library. You can switch between them by setting the http_client parameter in the Crawler class. The default HTTP client is HttpxHttpClient. Below are examples of how to set the HTTP client for the BeautifulSoupCrawler.

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients import HttpxHttpClient


async def main() -> None:
http_client = HttpxHttpClient(
# Optional additional keyword arguments for `httpx.AsyncClient`.
timeout=10,
follow_redirects=True,
)

crawler = BeautifulSoupCrawler(
http_client=http_client,
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Enqueue all links from the page.
await context.enqueue_links()

# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}

# Push the extracted data to the default dataset.
await context.push_data(data)

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Installation

Since HttpxHttpClient is the default HTTP client, you don't need to install additional packages to use it. If you want to use CurlImpersonateHttpClient, you need to install crawlee with the curl-impersonate extra.

pip install 'crawlee[curl-impersonate]'

or install all available extras:

pip install 'crawlee[all]'

How HTTP clients work

We provide an abstract base class, BaseHttpClient, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the BaseHttpClient class and implement the required methods.