Skip to main content

ImpitHttpClient

HTTP client based on the impit library.

This client uses the impit library to perform HTTP requests in crawlers (BasicCrawler subclasses) and to manage sessions, proxies, and error handling.

See the HttpClient class for more common information about HTTP clients.

Usage

from crawlee.crawlers import HttpCrawler  # or any other HTTP client-based crawler
from crawlee.http_clients import ImpitHttpClient

http_client = ImpitHttpClient()
crawler = HttpCrawler(http_client=http_client)

Hierarchy

Index

Methods

__aenter__

__aexit__

  • async __aexit__(exc_type, exc_value, traceback): None
  • Deinitialize the client and clean up resources when exiting the context manager.


    Parameters

    • exc_type: BaseException | None
    • exc_value: BaseException | None
    • traceback: TracebackType | None

    Returns None

__init__

  • __init__(*, persist_cookies_per_session, http3, verify, browser, async_client_kwargs): None
  • Initialize a new instance.


    Parameters

    • optionalkeyword-onlypersist_cookies_per_session: bool = True

      Whether to persist cookies per HTTP session.

    • optionalkeyword-onlyhttp3: bool = True

      Whether to enable HTTP/3 support.

    • optionalkeyword-onlyverify: bool = True

      SSL certificates used to verify the identity of requested hosts.

    • optionalkeyword-onlybrowser: Browser | None = 'firefox'

      Browser to impersonate.

    • async_client_kwargs: Any

      Additional keyword arguments for impit.AsyncClient.

    Returns None

cleanup

  • async cleanup(): None

crawl

  • Perform the crawling for a given request.

    This method is called from crawler.run().


    Parameters

    • request: Request

      The request to be crawled.

    • optionalkeyword-onlysession: Session | None = None

      The session associated with the request.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      The information about the proxy to be used.

    • optionalkeyword-onlystatistics: Statistics | None = None

      The statistics object to register status codes.

    Returns HttpCrawlingResult

send_request

  • async send_request(url, *, method, headers, payload, session, proxy_info): HttpResponse
  • Send an HTTP request via the client.

    This method is called from context.send_request() helper.


    Parameters

    • url: str

      The URL to send the request to.

    • optionalkeyword-onlymethod: HttpMethod = 'GET'

      The HTTP method to use.

    • optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None

      The headers to include in the request.

    • optionalkeyword-onlypayload: HttpPayload | None = None

      The data to be sent as the request body.

    • optionalkeyword-onlysession: Session | None = None

      The session associated with the request.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      The information about the proxy to be used.

    Returns HttpResponse

stream

  • stream(url, *, method, headers, payload, session, proxy_info, timeout): AbstractAsyncContextManager[HttpResponse]
  • Stream an HTTP request via the client.

    This method should be used for downloading potentially large data where you need to process the response body in chunks rather than loading it entirely into memory.


    Parameters

    • url: str

      The URL to send the request to.

    • optionalkeyword-onlymethod: HttpMethod = 'GET'

      The HTTP method to use.

    • optionalkeyword-onlyheaders: (HttpHeaders | dict[str, str]) | None = None

      The headers to include in the request.

    • optionalkeyword-onlypayload: HttpPayload | None = None

      The data to be sent as the request body.

    • optionalkeyword-onlysession: Session | None = None

      The session associated with the request.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      The information about the proxy to be used.

    • optionalkeyword-onlytimeout: timedelta | None = None

      The maximum time to wait for establishing the connection.

    Returns AbstractAsyncContextManager[HttpResponse]

Properties

active

active: bool

Indicate whether the context is active.