Skip to main content

BaseHttpClient

crawlee.http_clients._base.BaseHttpClient

An abstract base class for HTTP clients used in crawlers (BasicCrawler subclasses).

The specific HTTP client should use _raise_for_error_status_code method for checking the status code. This way the consistent behaviour accross different HTTP clients can be maintained. It raises an HttpStatusCodeError when it encounters an error response, defined by default as any HTTP status code in the range of 400 to 599. The error handling behavior is customizable, allowing the user to specify additional status codes to treat as errors or to exclude specific status codes from being considered errors. See additional_http_error_status_codes and ignore_http_error_status_codes arguments in the constructor.

Index

Constructors

Methods

Constructors

__init__

  • __init__(*, persist_cookies_per_session, additional_http_error_status_codes, ignore_http_error_status_codes): None
  • Create a new instance.


    Parameters

    • persist_cookies_per_session: bool = Truekeyword-only
    • additional_http_error_status_codes: Iterable[int] = ()keyword-only
    • ignore_http_error_status_codes: Iterable[int] = ()keyword-only

    Returns None

Methods

crawl

  • async crawl(request, *, session, proxy_info, statistics): HttpCrawlingResult
  • Perform the crawling for a given request.

    This method is called from crawler.run().


    Parameters

    • request: Request
    • session: Session | None = Nonekeyword-only
    • proxy_info: ProxyInfo | None = Nonekeyword-only
    • statistics: Statistics | None = Nonekeyword-only

    Returns HttpCrawlingResult

send_request

  • async send_request(url, *, method, headers, query_params, data, session, proxy_info): HttpResponse
  • Send an HTTP request via the client.

    This method is called from context.send_request() helper.


    Parameters

    • url: str
    • method: HttpMethod = 'GET'keyword-only
    • headers: HttpHeaders | None = Nonekeyword-only
    • query_params: HttpQueryParams | None = Nonekeyword-only
    • data: dict[str, Any] | None = Nonekeyword-only
    • session: Session | None = Nonekeyword-only
    • proxy_info: ProxyInfo | None = Nonekeyword-only

    Returns HttpResponse