Skip to main content

AbstractHttpCrawler

A web crawler for performing HTTP requests.

The AbstractHttpCrawler builds on top of the BasicCrawler, inheriting all its features. Additionally, it implements HTTP communication using HTTP clients. The class allows integration with any HTTP client that implements the HttpClient interface, provided as an input parameter to the constructor.

AbstractHttpCrawler is a generic class intended to be used with a specific parser for parsing HTTP responses and the expected type of TCrawlingContext available to the user function. Examples of specific versions include BeautifulSoupCrawler, ParselCrawler, and HttpCrawler.

HTTP client-based crawlers are ideal for websites that do not require JavaScript execution. For websites that require client-side JavaScript execution, consider using a browser-based crawler like the PlaywrightCrawler.

Hierarchy

Index

Methods

__init__

  • __init__(*, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, request_handler, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, additional_http_error_status_codes, ignore_http_error_status_codes, concurrency_settings, request_handler_timeout, statistics, abort_on_error, keep_alive, configure_logging, statistics_log_format, respect_robots_txt_file, status_message_logging_interval, status_message_callback, _context_pipeline, _additional_context_managers, _logger): None
  • Initialize a new instance.


    Parameters

    • optionalkeyword-onlyconfiguration: Configuration | None = None

      The Configuration instance. Some of its properties are used as defaults for the crawler.

    • optionalkeyword-onlyevent_manager: EventManager | None = None

      The event manager for managing events for the crawler and all its components.

    • optionalkeyword-onlystorage_client: StorageClient | None = None

      The storage client for managing storages for the crawler and all its components.

    • optionalkeyword-onlyrequest_manager: RequestManager | None = None

      Manager of requests that should be processed by the crawler.

    • optionalkeyword-onlysession_pool: SessionPool | None = None

      A custom SessionPool instance, allowing the use of non-default configuration.

    • optionalkeyword-onlyproxy_configuration: ProxyConfiguration | None = None

      HTTP proxy configuration used when making requests.

    • optionalkeyword-onlyhttp_client: HttpClient | None = None

      HTTP client used by BasicCrawlingContext.send_request method.

    • optionalkeyword-onlyrequest_handler: Callable[[TCrawlingContext], Awaitable[None]] | None = None

      A callable responsible for handling requests.

    • optionalkeyword-onlymax_request_retries: int = 3

      Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.). This limit does not apply to retries triggered by session rotation (see max_session_rotations).

    • optionalkeyword-onlymax_requests_per_crawl: int | None = None

      Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value. If used together with keep_alive, then the crawler will be kept alive only until max_requests_per_crawl is achieved.

    • optionalkeyword-onlymax_session_rotations: int = 10

      Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request. The session rotations are not counted towards the max_request_retries limit.

    • optionalkeyword-onlymax_crawl_depth: int | None = None

      Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

    • optionalkeyword-onlyuse_session_pool: bool = True

      Enable the use of a session pool for managing sessions during crawling.

    • optionalkeyword-onlyretry_on_blocked: bool = True

      If True, the crawler attempts to bypass bot protections automatically.

    • optionalkeyword-onlyadditional_http_error_status_codes: Iterable[int] | None = None

      Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

    • optionalkeyword-onlyignore_http_error_status_codes: Iterable[int] | None = None

      HTTP status codes that are typically considered errors but should be treated as successful responses.

    • optionalkeyword-onlyconcurrency_settings: ConcurrencySettings | None = None

      Settings to fine-tune concurrency levels.

    • optionalkeyword-onlyrequest_handler_timeout: timedelta = timedelta(minutes=1)

      Maximum duration allowed for a single request handler to run.

    • optionalkeyword-onlystatistics: Statistics[TStatisticsState] | None = None

      A custom Statistics instance, allowing the use of non-default configuration.

    • optionalkeyword-onlyabort_on_error: bool = False

      If True, the crawler stops immediately when any request handler error occurs.

    • optionalkeyword-onlykeep_alive: bool = False

      If True, it will keep crawler alive even if there are no requests in queue. Use crawler.stop() to exit the crawler.

    • optionalkeyword-onlyconfigure_logging: bool = True

      If True, the crawler will set up logging infrastructure automatically.

    • optionalkeyword-onlystatistics_log_format: Literal[table, inline] = 'table'

      If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.

    • optionalkeyword-onlyrespect_robots_txt_file: bool = False

      If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction

    • optionalkeyword-onlystatus_message_logging_interval: timedelta = timedelta(seconds=10)

      Interval for logging the crawler status messages.

    • optionalkeyword-onlystatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] | None = None

      Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.

    • optionalkeyword-only_context_pipeline: ContextPipeline[TCrawlingContext] | None = None

      Enables extending the request lifecycle and modifying the crawling context. Intended for use by subclasses rather than direct instantiation of BasicCrawler.

    • optionalkeyword-only_additional_context_managers: Sequence[AbstractAsyncContextManager] | None = None

      Additional context managers used throughout the crawler lifecycle. Intended for use by subclasses rather than direct instantiation of BasicCrawler.

    • optionalkeyword-only_logger: logging.Logger | None = None

      A logger instance, typically provided by a subclass, for consistent logging labels. Intended for use by subclasses rather than direct instantiation of BasicCrawler.

    Returns None

add_requests

  • async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying request manager in batches.


    Parameters

    • requests: Sequence[str | Request]

      A list of requests to add to the queue.

    • optionalkeyword-onlyforefront: bool = False

      If True, add requests to the forefront of the queue.

    • optionalkeyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)

      Time to wait between adding batches.

    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

create_parsed_http_crawler_class

error_handler

export_data

  • async export_data(path, dataset_id, dataset_name): None
  • Export all items from a Dataset to a JSON or CSV file.

    This method simplifies the process of exporting data collected during crawling. It automatically determines the export format based on the file extension (.json or .csv) and handles the conversion of Dataset items to the appropriate format.


    Parameters

    • path: str | Path

      The destination file path. Must end with '.json' or '.csv'.

    • optionaldataset_id: str | None = None

      The ID of the Dataset to export from. If None, uses name parameter instead.

    • optionaldataset_name: str | None = None

      The name of the Dataset to export from. If None, uses id parameter instead.

    Returns None

failed_request_handler

get_data

  • async get_data(dataset_id, dataset_name, *, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden, flatten, view): DatasetItemsListPage
  • Retrieve data from a Dataset.

    This helper method simplifies the process of retrieving data from a Dataset. It opens the specified one and then retrieves the data based on the provided parameters.


    Parameters

    • optionaldataset_id: str | None = None

      The ID of the Dataset.

    • optionaldataset_name: str | None = None

      The name of the Dataset.

    • keyword-onlyoptionaloffset: int

      Skips the specified number of items at the start.

    • keyword-onlyoptionallimit: int | None

      The maximum number of items to retrieve. Unlimited if None.

    • keyword-onlyoptionalclean: bool

      Return only non-empty items and excludes hidden fields. Shortcut for skip_hidden and skip_empty.

    • keyword-onlyoptionaldesc: bool

      Set to True to sort results in descending order.

    • keyword-onlyoptionalfields: list[str]

      Fields to include in each item. Sorts fields as specified if provided.

    • keyword-onlyoptionalomit: list[str]

      Fields to exclude from each item.

    • keyword-onlyoptionalunwind: str

      Unwinds items by a specified array field, turning each element into a separate item.

    • keyword-onlyoptionalskip_empty: bool

      Excludes empty items from the results if True.

    • keyword-onlyoptionalskip_hidden: bool

      Excludes fields starting with '#' if True.

    • keyword-onlyoptionalflatten: list[str]

      Fields to be flattened in returned items.

    • keyword-onlyoptionalview: str

      Specifies the dataset view to be used.

    Returns DatasetItemsListPage

get_dataset

  • async get_dataset(*, id, name): Dataset
  • Return the Dataset with the given ID or name. If none is provided, return the default one.


    Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None

    Returns Dataset

get_key_value_store

  • Return the KeyValueStore with the given ID or name. If none is provided, return the default KVS.


    Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None

    Returns KeyValueStore

get_request_manager

on_skipped_request

pre_navigation_hook

  • pre_navigation_hook(hook): None
  • Register a hook to be called before each navigation.


    Parameters

    • hook: Callable[[BasicCrawlingContext], Awaitable[None]]

      A coroutine function to be called before each navigation.

    Returns None

run

  • Run the crawler until all requests are processed.


    Parameters

    • optionalrequests: Sequence[str | Request] | None = None

      The requests to be enqueued before the crawler starts.

    • optionalkeyword-onlypurge_request_queue: bool = True

      If this is True and the crawler is not being run for the first time, the default request queue will be purged.

    Returns FinalStatistics

stop

  • stop(reason): None
  • Set flag to stop crawler.

    This stops current crawler run regardless of whether all requests were finished.


    Parameters

    • optionalreason: str = 'Stop was called externally.'

      Reason for stopping that will be used in logs.

    Returns None

Properties

log

log: logging.Logger

The logger used by the crawler.

router

The Router used to handle each individual crawling request.

statistics

Statistics about the current (or last) crawler run.