HTTP clients
HTTP clients are utilized by HTTP-based crawlers (e.g., ParselCrawler
and BeautifulSoupCrawler
) to communicate with web servers. They use external HTTP libraries for communication rather than a browser. Examples of such libraries include httpx, aiohttp, curl-cffi, and impit. After retrieving page content, an HTML parsing library is typically used to facilitate data extraction. Examples of such libraries include beautifulsoup, parsel, selectolax, and pyquery. These crawlers are faster than browser-based crawlers but cannot execute client-side JavaScript.
Switching between HTTP clients
Crawlee currently provides three main HTTP clients: HttpxHttpClient
, which uses the httpx
library, CurlImpersonateHttpClient
, which uses the curl-cffi
library, and ImpitHttpClient
, which uses the impit
library. You can switch between them by setting the http_client
parameter when initializing a crawler class. The default HTTP client is HttpxHttpClient
.
Below are examples of how to configure the HTTP client for the ParselCrawler
:
- ParselCrawler with HTTPX
- ParselCrawler with curl-cffi
- ParselCrawler with impit
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
async def main() -> None:
http_client = HttpxHttpClient(
# Optional additional keyword arguments for `httpx.AsyncClient`.
timeout=10,
follow_redirects=True,
)
crawler = ParselCrawler(
http_client=http_client,
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Enqueue all links from the page.
await context.enqueue_links()
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
# Push the extracted data to the default dataset.
await context.push_data(data)
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import CurlImpersonateHttpClient
async def main() -> None:
http_client = CurlImpersonateHttpClient(
# Optional additional keyword arguments for `curl_cffi.requests.AsyncSession`.
timeout=10,
impersonate='chrome131',
)
crawler = ParselCrawler(
http_client=http_client,
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Enqueue all links from the page.
await context.enqueue_links()
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
# Push the extracted data to the default dataset.
await context.push_data(data)
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import ImpitHttpClient
async def main() -> None:
http_client = ImpitHttpClient(
# Optional additional keyword arguments for `impit.AsyncClient`.
http3=True,
browser='firefox',
verify=True,
)
crawler = ParselCrawler(
http_client=http_client,
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Enqueue all links from the page.
await context.enqueue_links()
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
# Push the extracted data to the default dataset.
await context.push_data(data)
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
Installation requirements
Since HttpxHttpClient
is the default HTTP client, it's included with the base Crawlee installation and requires no additional packages.
For CurlImpersonateHttpClient
, you need to install Crawlee with the curl-impersonate
extra:
python -m pip install 'crawlee[curl-impersonate]'
For ImpitHttpClient
, you need to install Crawlee with the impit
extra:
python -m pip install 'crawlee[impit]'
Alternatively, you can install all available extras to get access to all HTTP clients and features:
python -m pip install 'crawlee[all]'
Creating custom HTTP clients
Crawlee provides an abstract base class, HttpClient
, which defines the interface that all HTTP clients must implement. This allows you to create custom HTTP clients tailored to your specific requirements.
HTTP clients are responsible for several key operations:
- sending HTTP requests and receiving responses,
- managing cookies and sessions,
- handling headers and authentication,
- managing proxy configurations,
- connection pooling with timeout management.
To create a custom HTTP client, you need to inherit from the HttpClient
base class and implement all required abstract methods. Your implementation must be async-compatible and include proper cleanup and resource management to work seamlessly with Crawlee's concurrent processing model.
Conclusion
This guide introduced you to the HTTP clients available in Crawlee and demonstrated how to switch between them, including their installation requirements and usage examples. You also learned about the responsibilities of HTTP clients and how to implement your own custom HTTP client by inheriting from the HttpClient
base class.
If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!