CurlImpersonateHttpClient

HTTP client based on the curl-cffi library.

This client uses the curl-cffi library to perform HTTP requests in crawlers (BasicCrawler subclasses) and to manage sessions, proxies, and error handling.

See the HttpClient class for more common information about HTTP clients.

Usage

from crawlee.crawlers import HttpCrawler  # or any other HTTP client-based crawler
from crawlee.http_clients import CurlImpersonateHttpClient

http_client = CurlImpersonateHttpClient()
crawler = HttpCrawler(http_client=http_client)

Hierarchy

HttpClient
- CurlImpersonateHttpClient

Index

Methods

Properties

active

Methods

aenter

async __aenter__(): HttpClient

Inherited from HttpClient.__aenter__
Initialize the client when entering the context manager.
Returns HttpClient

aexit

async __aexit__(exc_type, exc_value, traceback): None

Inherited from HttpClient.__aexit__
Deinitialize the client and clean up resources when exiting the context manager.
Parameters
- exc_type: BaseException | None
- exc_value: BaseException | None
- traceback: TracebackType | None
Returns None

init

__init__(*, persist_cookies_per_session, async_session_kwargs): None

Overrides HttpClient.__init__
Initialize a new instance.
Parameters
- optionalkeyword-onlypersist_cookies_per_session: bool = True
  Whether to persist cookies per HTTP session.
- async_session_kwargs: Any
  Additional keyword arguments for curl_cffi.requests.AsyncSession.
Returns None

cleanup

async cleanup(): None

Overrides HttpClient.cleanup
Clean up resources used by the client.

This method is called when the client is no longer needed and should be overridden in subclasses to perform any necessary cleanup such as closing connections, releasing file handles, or other resource deallocation.
Returns None

crawl

async crawl(request, *, session, proxy_info, statistics): HttpCrawlingResult

Overrides HttpClient.crawl
Perform the crawling for a given request.

This method is called from crawler.run().
Parameters
- request: Request
  The request to be crawled.
- optionalkeyword-onlysession: Session | None = None
  The session associated with the request.
- optionalkeyword-onlyproxy_info: ProxyInfo | None = None
  The information about the proxy to be used.
- optionalkeyword-onlystatistics: Statistics | None = None
  The statistics object to register status codes.
Returns HttpCrawlingResult

send_request

async send_request(url, *, method, headers, payload, session, proxy_info): HttpResponse

Overrides HttpClient.send_request
Send an HTTP request via the client.

This method is called from context.send_request() helper.
Parameters
- url: str
  The URL to send the request to.
- optionalkeyword-onlymethod: HttpMethod = 'GET'
  The HTTP method to use.
- optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None
  The headers to include in the request.
- optionalkeyword-onlypayload: HttpPayload | None = None
  The data to be sent as the request body.
- optionalkeyword-onlysession: Session | None = None
  The session associated with the request.
- optionalkeyword-onlyproxy_info: ProxyInfo | None = None
  The information about the proxy to be used.
Returns HttpResponse

stream

stream(url, *, method, headers, payload, session, proxy_info, timeout): AbstractAsyncContextManager[HttpResponse]

Overrides HttpClient.stream
Stream an HTTP request via the client.

This method should be used for downloading potentially large data where you need to process the response body in chunks rather than loading it entirely into memory.
Parameters
- url: str
  The URL to send the request to.
- optionalkeyword-onlymethod: HttpMethod = 'GET'
  The HTTP method to use.
- optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None
  The headers to include in the request.
- optionalkeyword-onlypayload: HttpPayload | None = None
  The data to be sent as the request body.
- optionalkeyword-onlysession: Session | None = None
  The session associated with the request.
- optionalkeyword-onlyproxy_info: ProxyInfo | None = None
  The information about the proxy to be used.
- optionalkeyword-onlytimeout: timedelta | None = None
  The maximum time to wait for establishing the connection.
Returns AbstractAsyncContextManager[HttpResponse]

Properties

active

active: bool

Indicate whether the context is active.

Usage

Hierarchy

Index

Methods

Properties

Methods

__aenter__

Returns HttpClient

__aexit__

Parameters

exc_type: BaseException | None

exc_value: BaseException | None

traceback: TracebackType | None

Returns None

__init__

Parameters

optionalkeyword-onlypersist_cookies_per_session: bool = True

async_session_kwargs: Any

Returns None

cleanup

Returns None

crawl

Parameters

request: Request

optionalkeyword-onlysession: Session | None = None

optionalkeyword-onlyproxy_info: ProxyInfo | None = None

optionalkeyword-onlystatistics: Statistics | None = None

Returns HttpCrawlingResult

send_request

Parameters

url: str

optionalkeyword-onlymethod: HttpMethod = 'GET'

optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None

optionalkeyword-onlypayload: HttpPayload | None = None

optionalkeyword-onlysession: Session | None = None

optionalkeyword-onlyproxy_info: ProxyInfo | None = None

Returns HttpResponse

stream

Parameters

url: str

optionalkeyword-onlymethod: HttpMethod = 'GET'

optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None

optionalkeyword-onlypayload: HttpPayload | None = None

optionalkeyword-onlysession: Session | None = None

optionalkeyword-onlyproxy_info: ProxyInfo | None = None

optionalkeyword-onlytimeout: timedelta | None = None

Returns AbstractAsyncContextManager[HttpResponse]

Properties

active

aenter

aexit

init