HTTP clients
HTTP clients are utilized by the HTTP-based crawlers (e.g. BeautifulSoupCrawler
) to communicate with web servers. They use external HTTP libraries for communication, rather than a browser. Examples of such libraries include httpx, aiohttp or curl-cffi. After retrieving page content, an HTML parsing library is typically used to facilitate data extraction. Examples of such libraries are beautifulsoup, parsel, selectolax or pyquery. These crawlers are faster than browser-based crawlers but they cannot execute client-side JavaScript.
How to switch between HTTP clients
In Crawlee we currently have two HTTP clients: HttpxHttpClient
, which uses the httpx
library, and CurlImpersonateHttpClient
, which uses the curl-cffi
library. You can switch between them by setting the http_client
parameter in the Crawler class. The default HTTP client is HttpxHttpClient
. Below are examples of how to set the HTTP client for the BeautifulSoupCrawler
.
- BeautifulSoupCrawler with HTTPX
- BeautifulSoupCrawler with Curl impersonate
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients import HttpxHttpClient
async def main() -> None:
http_client = HttpxHttpClient(
# Optional additional keyword arguments for `httpx.AsyncClient`.
timeout=10,
follow_redirects=True,
)
crawler = BeautifulSoupCrawler(
http_client=http_client,
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Enqueue all links from the page.
await context.enqueue_links()
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
# Push the extracted data to the default dataset.
await context.push_data(data)
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients import CurlImpersonateHttpClient
async def main() -> None:
http_client = CurlImpersonateHttpClient(
# Optional additional keyword arguments for `curl_cffi.requests.AsyncSession`.
timeout=10,
impersonate='chrome124',
)
crawler = BeautifulSoupCrawler(
http_client=http_client,
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Enqueue all links from the page.
await context.enqueue_links()
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
# Push the extracted data to the default dataset.
await context.push_data(data)
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
Installation
Since HttpxHttpClient
is the default HTTP client, you don't need to install additional packages to use it. If you want to use CurlImpersonateHttpClient
, you need to install crawlee
with the curl-impersonate
extra.
pip install 'crawlee[curl-impersonate]'
or install all available extras:
pip install 'crawlee[all]'
How HTTP clients work
We provide an abstract base class, BaseHttpClient
, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the BaseHttpClient
class and implement the required methods.