Skip to main content

HttpxHttpClient

HTTP client based on the HTTPX library.

This client uses the HTTPX library to perform HTTP requests in crawlers (BasicCrawler subclasses) and to manage sessions, proxies, and error handling.

See the BaseHttpClient class for more common information about HTTP clients.

Usage

from crawlee.http_clients import HttpxHttpClient
from crawlee.http_crawler import HttpCrawler # or any other HTTP client-based crawler

http_client = HttpxHttpClient()
crawler = HttpCrawler(http_client=http_client)

Hierarchy

Index

Methods

__init__

  • __init__(*, persist_cookies_per_session, additional_http_error_status_codes, ignore_http_error_status_codes, http1, http2, verify, header_generator, async_client_kwargs): None
  • A default constructor.


    Parameters

    • optionalkeyword-onlypersist_cookies_per_session: bool = True

      Whether to persist cookies per HTTP session.

    • optionalkeyword-onlyadditional_http_error_status_codes: Iterable[int] = ()

      Additional HTTP status codes to treat as errors.

    • optionalkeyword-onlyignore_http_error_status_codes: Iterable[int] = ()

      HTTP status codes to ignore as errors.

    • optionalkeyword-onlyhttp1: bool = True

      Whether to enable HTTP/1.1 support.

    • optionalkeyword-onlyhttp2: bool = True

      Whether to enable HTTP/2 support.

    • optionalkeyword-onlyverify: (str | bool) | SSLContext = True

      SSL certificates used to verify the identity of requested hosts.

    • optionalkeyword-onlyheader_generator: HeaderGenerator | None = _DEFAULT_HEADER_GENERATOR

      Header generator instance to use for generating common headers.

    • optionalkeyword-onlyasync_client_kwargs: Any

      Additional keyword arguments for httpx.AsyncClient.

    Returns None

crawl

  • Perform the crawling for a given request.

    This method is called from crawler.run().


    Parameters

    • optionalkeyword-onlyrequest: Request

      The request to be crawled.

    • optionalkeyword-onlysession: Session | None = None

      The session associated with the request.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      The information about the proxy to be used.

    • optionalkeyword-onlystatistics: Statistics | None = None

      The statistics object to register status codes.

    Returns HttpCrawlingResult

send_request

  • async send_request(*, url, method, headers, payload, session, proxy_info): HttpResponse
  • Send an HTTP request via the client.

    This method is called from context.send_request() helper.


    Parameters

    • optionalkeyword-onlyurl: str

      The URL to send the request to.

    • optionalkeyword-onlymethod: HttpMethod = 'GET'

      The HTTP method to use.

    • optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None

      The headers to include in the request.

    • optionalkeyword-onlypayload: HttpPayload | None = None

      The data to be sent as the request body.

    • optionalkeyword-onlysession: Session | None = None

      The session associated with the request.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      The information about the proxy to be used.

    Returns HttpResponse

Properties

additional_blocked_status_codes

additional_blocked_status_codes: set[int]

ignore_http_error_status_codes

ignore_http_error_status_codes: set[int]