AdaptivePlaywrightCrawler

An adaptive web crawler capable of using both static HTTP request based crawling and browser based crawling.

It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit. It uses specific implementation of AbstractHttpCrawler and PlaywrightCrawler.

Usage

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
    max_requests_per_crawl=10,  # Limit the max requests per crawl.
    playwright_crawler_specific_kwargs={'browser_type': 'chromium'},
)

@crawler.router.default_handler
async def request_handler_for_label(context: AdaptivePlaywrightCrawlingContext) -> None:
    # Do some processing using `parsed_content`
    context.log.info(context.parsed_content.title)

    # Locate element h2 within 5 seconds
    h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
    # Do stuff with element found by the selector
    context.log.info(h2)

    # Find more links and enqueue them.
    await context.enqueue_links()
    # Save some data.
    await context.push_data({'Visited url': context.request.url})

await crawler.run(['https://crawlee.dev/'])

Hierarchy

BasicCrawler
- AdaptivePlaywrightCrawler

Methods

init

__init__(*, static_parser, rendering_type_predictor, result_checker, result_comparator, playwright_crawler_specific_kwargs, statistics, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, concurrency_settings, request_handler_timeout, abort_on_error, configure_logging, statistics_log_format, keep_alive, additional_http_error_status_codes, ignore_http_error_status_codes, respect_robots_txt_file, status_message_logging_interval, status_message_callback): None

Overrides BasicCrawler.__init__
Initialize a new instance. Recommended way to create instance is to call factory methods.

Recommended factory methods: with_beautifulsoup_static_parser, with_parsel_static_parser.
Parameters
- keyword-onlystatic_parser: AbstractHttpParser[TStaticParseResult, TStaticSelectResult]
  Implementation of AbstractHttpParser. Parser that will be used for static crawling.
- optionalkeyword-onlyrendering_type_predictor: RenderingTypePredictor | None = None
  Object that implements RenderingTypePredictor and is capable of predicting which rendering method should be used. If None, then DefaultRenderingTypePredictor is used.
- optionalkeyword-onlyresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
  Function that evaluates whether crawling result is valid or not.
- optionalkeyword-onlyresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
  Function that compares two crawling results and decides whether they are equivalent.
- optionalkeyword-onlyplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
  PlaywrightCrawler only kwargs that are passed to the sub crawler.
- optionalkeyword-onlystatistics: Statistics[AdaptivePlaywrightCrawlerStatisticState] | None = None
  A custom Statistics[AdaptivePlaywrightCrawlerStatisticState] instance, allowing the use of non-default configuration.
- keyword-onlyoptionalconfiguration: Configuration
  The Configuration instance. Some of its properties are used as defaults for the crawler.
- keyword-onlyoptionalevent_manager: EventManager
  The event manager for managing events for the crawler and all its components.
- keyword-onlyoptionalstorage_client: StorageClient
  The storage client for managing storages for the crawler and all its components.
- keyword-onlyoptionalrequest_manager: RequestManager
  Manager of requests that should be processed by the crawler.
- keyword-onlyoptionalsession_pool: SessionPool
  A custom SessionPool instance, allowing the use of non-default configuration.
- keyword-onlyoptionalproxy_configuration: ProxyConfiguration
  HTTP proxy configuration used when making requests.
- keyword-onlyoptionalhttp_client: HttpClient
  HTTP client used by BasicCrawlingContext.send_request method.
- keyword-onlyoptionalmax_request_retries: int
  Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.).
  
  This limit does not apply to retries triggered by session rotation (see max_session_rotations).
- keyword-onlyoptionalmax_requests_per_crawl: int | None
  Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
- keyword-onlyoptionalmax_session_rotations: int
  Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
  
  The session rotations are not counted towards the max_request_retries limit.
- keyword-onlyoptionalmax_crawl_depth: int | None
  Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
- keyword-onlyoptionaluse_session_pool: bool
  Enable the use of a session pool for managing sessions during crawling.
- keyword-onlyoptionalretry_on_blocked: bool
  If True, the crawler attempts to bypass bot protections automatically.
- keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
  Settings to fine-tune concurrency levels.
- keyword-onlyoptionalrequest_handler_timeout: timedelta
  Maximum duration allowed for a single request handler to run.
- keyword-onlyoptionalabort_on_error: bool
  If True, the crawler stops immediately when any request handler error occurs.
- keyword-onlyoptionalconfigure_logging: bool
  If True, the crawler will set up logging infrastructure automatically.
- keyword-onlyoptionalstatistics_log_format: Literal[table, inline]
  If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.
- keyword-onlyoptionalkeep_alive: bool
  Flag that can keep crawler running even when there are no requests in queue.
- keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
  Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
- keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
  HTTP status codes that are typically considered errors but should be treated as successful responses.
- keyword-onlyoptionalrespect_robots_txt_file: bool
  If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction.
- keyword-onlyoptionalstatus_message_logging_interval: timedelta
  Interval for logging the crawler status messages.
- keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
  Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.
Returns None

add_requests

async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None

Inherited from BasicCrawler.add_requests
Add requests to the underlying request manager in batches.
Parameters
- requests: Sequence[str | Request]
  A list of requests to add to the queue.
- optionalkeyword-onlyforefront: bool = False
  If True, add requests to the forefront of the queue.
- optionalkeyword-onlybatch_size: int = 1000
  The number of requests to add in one batch.
- optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)
  Time to wait between adding batches.
- optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
  If True, wait for all requests to be added before returning.
- optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
  Timeout for waiting for all requests to be added.
Returns None

error_handler

error_handler(handler): ErrorHandler[TCrawlingContext]

Inherited from BasicCrawler.error_handler
Register a function to handle errors occurring in request handlers.

The error handler is invoked after a request handler error occurs and before a retry attempt.
Parameters
- handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]

export_data

async export_data(path, dataset_id, dataset_name, dataset_alias): None

Inherited from BasicCrawler.export_data
Export all items from a Dataset to a JSON or CSV file.

This method simplifies the process of exporting data collected during crawling. It automatically determines the export format based on the file extension (.json or .csv) and handles the conversion of Dataset items to the appropriate format.
Parameters
- path: str | Path
  The destination file path. Must end with '.json' or '.csv'.
- optionaldataset_id: str | None = None
  The ID of the Dataset to export from.
- optionaldataset_name: str | None = None
  The name of the Dataset to export from (global scope, named storage).
- optionaldataset_alias: str | None = None
  The alias of the Dataset to export from (run scope, unnamed storage).
Returns None

failed_request_handler

failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]

Inherited from BasicCrawler.failed_request_handler
Register a function to handle requests that exceed the maximum retry limit.

The failed request handler is invoked when a request has failed all retry attempts.
Parameters
- handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]

get_data

async get_data(dataset_id, dataset_name, dataset_alias, kwargs): DatasetItemsListPage

Inherited from BasicCrawler.get_data
Retrieve data from a Dataset.

This helper method simplifies the process of retrieving data from a Dataset. It opens the specified one and then retrieves the data based on the provided parameters.
Parameters
- optionaldataset_id: str | None = None
  The ID of the Dataset.
- optionaldataset_name: str | None = None
  The name of the Dataset (global scope, named storage).
- optionaldataset_alias: str | None = None
  The alias of the Dataset (run scope, unnamed storage).
- kwargs: Unpack[GetDataKwargs]
  Keyword arguments to be passed to the Dataset.get_data() method.
Returns DatasetItemsListPage

get_dataset

async get_dataset(*, id, name, alias): Dataset

Inherited from BasicCrawler.get_dataset
Return the Dataset with the given ID or name. If none is provided, return the default one.
Parameters
- optionalkeyword-onlyid: str | None = None
- optionalkeyword-onlyname: str | None = None
- optionalkeyword-onlyalias: str | None = None
Returns Dataset

get_key_value_store

async get_key_value_store(*, id, name, alias): KeyValueStore

Inherited from BasicCrawler.get_key_value_store
Return the KeyValueStore with the given ID or name. If none is provided, return the default KVS.
Parameters
- optionalkeyword-onlyid: str | None = None
- optionalkeyword-onlyname: str | None = None
- optionalkeyword-onlyalias: str | None = None
Returns KeyValueStore

get_request_manager

async get_request_manager(): RequestManager

Inherited from BasicCrawler.get_request_manager
Return the configured request manager. If none is configured, open and return the default request queue.
Returns RequestManager

on_skipped_request

on_skipped_request(callback): SkippedRequestCallback

Inherited from BasicCrawler.on_skipped_request
Register a function to handle skipped requests.

The skipped request handler is invoked when a request is skipped due to a collision or other reasons.
Parameters
- callback: SkippedRequestCallback
Returns SkippedRequestCallback

pre_navigation_hook

pre_navigation_hook(hook, *, playwright_only): Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]

Pre navigation hooks for adaptive crawler are delegated to sub crawlers.

Optionally parametrized decorator. Hooks are wrapped in context that handles possibly missing page object by raising AdaptiveContextError.
Parameters
- optionalhook: Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]] | None = None
- optionalkeyword-onlyplaywright_only: bool = False
Returns Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]

run

async run(requests, *, purge_request_queue): FinalStatistics

Inherited from BasicCrawler.run
Run the crawler until all requests are processed.
Parameters
- optionalrequests: Sequence[str | Request] | None = None
  The requests to be enqueued before the crawler starts.
- optionalkeyword-onlypurge_request_queue: bool = True
  If this is True and the crawler is not being run for the first time, the default request queue will be purged.
Returns FinalStatistics

stop

stop(reason): None

Inherited from BasicCrawler.stop
Set flag to stop crawler.

This stops current crawler run regardless of whether all requests were finished.
Parameters
- optionalreason: str = 'Stop was called externally.'
  Reason for stopping that will be used in logs.
Returns None

track_browser_request_handler_runs

track_browser_request_handler_runs(): None

Returns None

track_http_only_request_handler_runs

track_http_only_request_handler_runs(): None

Returns None

track_rendering_type_mispredictions

track_rendering_type_mispredictions(): None

Returns None

with_beautifulsoup_static_parser

with_beautifulsoup_static_parser(rendering_type_predictor, result_checker, result_comparator, parser_type, playwright_crawler_specific_kwargs, statistics, *, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, concurrency_settings, request_handler_timeout, abort_on_error, configure_logging, statistics_log_format, keep_alive, additional_http_error_status_codes, ignore_http_error_status_codes, respect_robots_txt_file, status_message_logging_interval, status_message_callback): AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[BeautifulSoup], BeautifulSoup, Tag]

Create AdaptivePlaywrightCrawler that uses BeautifulSoup for parsing static content.
Parameters
- optionalrendering_type_predictor: RenderingTypePredictor | None = None
- optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
- optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
- optionalparser_type: BeautifulSoupParserType = 'lxml'
- optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
- optionalstatistics: Statistics[StatisticsState] | None = None
- keyword-onlyoptionalconfiguration: Configuration
  The Configuration instance. Some of its properties are used as defaults for the crawler.
- keyword-onlyoptionalevent_manager: EventManager
  The event manager for managing events for the crawler and all its components.
- keyword-onlyoptionalstorage_client: StorageClient
  The storage client for managing storages for the crawler and all its components.
- keyword-onlyoptionalrequest_manager: RequestManager
  Manager of requests that should be processed by the crawler.
- keyword-onlyoptionalsession_pool: SessionPool
  A custom SessionPool instance, allowing the use of non-default configuration.
- keyword-onlyoptionalproxy_configuration: ProxyConfiguration
  HTTP proxy configuration used when making requests.
- keyword-onlyoptionalhttp_client: HttpClient
  HTTP client used by BasicCrawlingContext.send_request method.
- keyword-onlyoptionalmax_request_retries: int
  Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.).
  
  This limit does not apply to retries triggered by session rotation (see max_session_rotations).
- keyword-onlyoptionalmax_requests_per_crawl: int | None
  Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
- keyword-onlyoptionalmax_session_rotations: int
  Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
  
  The session rotations are not counted towards the max_request_retries limit.
- keyword-onlyoptionalmax_crawl_depth: int | None
  Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
- keyword-onlyoptionaluse_session_pool: bool
  Enable the use of a session pool for managing sessions during crawling.
- keyword-onlyoptionalretry_on_blocked: bool
  If True, the crawler attempts to bypass bot protections automatically.
- keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
  Settings to fine-tune concurrency levels.
- keyword-onlyoptionalrequest_handler_timeout: timedelta
  Maximum duration allowed for a single request handler to run.
- keyword-onlyoptionalabort_on_error: bool
  If True, the crawler stops immediately when any request handler error occurs.
- keyword-onlyoptionalconfigure_logging: bool
  If True, the crawler will set up logging infrastructure automatically.
- keyword-onlyoptionalstatistics_log_format: Literal[table, inline]
  If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.
- keyword-onlyoptionalkeep_alive: bool
  Flag that can keep crawler running even when there are no requests in queue.
- keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
  Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
- keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
  HTTP status codes that are typically considered errors but should be treated as successful responses.
- keyword-onlyoptionalrespect_robots_txt_file: bool
  If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction.
- keyword-onlyoptionalstatus_message_logging_interval: timedelta
  Interval for logging the crawler status messages.
- keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
  Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.
Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[BeautifulSoup], BeautifulSoup, Tag]

with_parsel_static_parser

with_parsel_static_parser(rendering_type_predictor, result_checker, result_comparator, playwright_crawler_specific_kwargs, statistics, *, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, concurrency_settings, request_handler_timeout, abort_on_error, configure_logging, statistics_log_format, keep_alive, additional_http_error_status_codes, ignore_http_error_status_codes, respect_robots_txt_file, status_message_logging_interval, status_message_callback): AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[Selector], Selector, Selector]

Create AdaptivePlaywrightCrawler that uses Parcel for parsing static content.
Parameters
- optionalrendering_type_predictor: RenderingTypePredictor | None = None
- optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
- optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
- optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
- optionalstatistics: Statistics[StatisticsState] | None = None
- keyword-onlyoptionalconfiguration: Configuration
  The Configuration instance. Some of its properties are used as defaults for the crawler.
- keyword-onlyoptionalevent_manager: EventManager
  The event manager for managing events for the crawler and all its components.
- keyword-onlyoptionalstorage_client: StorageClient
  The storage client for managing storages for the crawler and all its components.
- keyword-onlyoptionalrequest_manager: RequestManager
  Manager of requests that should be processed by the crawler.
- keyword-onlyoptionalsession_pool: SessionPool
  A custom SessionPool instance, allowing the use of non-default configuration.
- keyword-onlyoptionalproxy_configuration: ProxyConfiguration
  HTTP proxy configuration used when making requests.
- keyword-onlyoptionalhttp_client: HttpClient
  HTTP client used by BasicCrawlingContext.send_request method.
- keyword-onlyoptionalmax_request_retries: int
  Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.).
  
  This limit does not apply to retries triggered by session rotation (see max_session_rotations).
- keyword-onlyoptionalmax_requests_per_crawl: int | None
  Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
- keyword-onlyoptionalmax_session_rotations: int
  Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
  
  The session rotations are not counted towards the max_request_retries limit.
- keyword-onlyoptionalmax_crawl_depth: int | None
  Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
- keyword-onlyoptionaluse_session_pool: bool
  Enable the use of a session pool for managing sessions during crawling.
- keyword-onlyoptionalretry_on_blocked: bool
  If True, the crawler attempts to bypass bot protections automatically.
- keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
  Settings to fine-tune concurrency levels.
- keyword-onlyoptionalrequest_handler_timeout: timedelta
  Maximum duration allowed for a single request handler to run.
- keyword-onlyoptionalabort_on_error: bool
  If True, the crawler stops immediately when any request handler error occurs.
- keyword-onlyoptionalconfigure_logging: bool
  If True, the crawler will set up logging infrastructure automatically.
- keyword-onlyoptionalstatistics_log_format: Literal[table, inline]
  If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.
- keyword-onlyoptionalkeep_alive: bool
  Flag that can keep crawler running even when there are no requests in queue.
- keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
  Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
- keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
  HTTP status codes that are typically considered errors but should be treated as successful responses.
- keyword-onlyoptionalrespect_robots_txt_file: bool
  If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction.
- keyword-onlyoptionalstatus_message_logging_interval: timedelta
  Interval for logging the crawler status messages.
- keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
  Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.
Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[Selector], Selector, Selector]

Properties

log

log: logging.Logger

The logger used by the crawler.

router

router: Router[TCrawlingContext]

The Router used to handle each individual crawling request.

statistics

statistics: Statistics[TStatisticsState]

Statistics about the current (or last) crawler run.

Usage

Hierarchy

Index

Methods

Properties

Methods

__init__

Parameters

keyword-onlystatic_parser: AbstractHttpParser[TStaticParseResult, TStaticSelectResult]

optionalkeyword-onlyrendering_type_predictor: RenderingTypePredictor | None = None

optionalkeyword-onlyresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None

optionalkeyword-onlyresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None

optionalkeyword-onlyplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None

optionalkeyword-onlystatistics: Statistics[AdaptivePlaywrightCrawlerStatisticState] | None = None

keyword-onlyoptionalconfiguration: Configuration

keyword-onlyoptionalevent_manager: EventManager

keyword-onlyoptionalstorage_client: StorageClient

keyword-onlyoptionalrequest_manager: RequestManager

keyword-onlyoptionalsession_pool: SessionPool

keyword-onlyoptionalproxy_configuration: ProxyConfiguration

keyword-onlyoptionalhttp_client: HttpClient

keyword-onlyoptionalmax_request_retries: int

keyword-onlyoptionalmax_requests_per_crawl: int | None

keyword-onlyoptionalmax_session_rotations: int

keyword-onlyoptionalmax_crawl_depth: int | None

keyword-onlyoptionaluse_session_pool: bool

keyword-onlyoptionalretry_on_blocked: bool

keyword-onlyoptionalconcurrency_settings: ConcurrencySettings

keyword-onlyoptionalrequest_handler_timeout: timedelta

keyword-onlyoptionalabort_on_error: bool

keyword-onlyoptionalconfigure_logging: bool

keyword-onlyoptionalstatistics_log_format: Literal[table, inline]

keyword-onlyoptionalkeep_alive: bool

keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]

keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]

keyword-onlyoptionalrespect_robots_txt_file: bool

keyword-onlyoptionalstatus_message_logging_interval: timedelta

keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]

Returns None

add_requests

Parameters

requests: Sequence[str | Request]

optionalkeyword-onlyforefront: bool = False

optionalkeyword-onlybatch_size: int = 1000

optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)

optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

Returns None

error_handler

Parameters

handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]

Returns ErrorHandler[TCrawlingContext]

export_data

Parameters

path: str | Path

optionaldataset_id: str | None = None

optionaldataset_name: str | None = None

optionaldataset_alias: str | None = None

Returns None

failed_request_handler

Parameters

handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]

Returns FailedRequestHandler[TCrawlingContext]

get_data

Parameters

optionaldataset_id: str | None = None

optionaldataset_name: str | None = None

optionaldataset_alias: str | None = None

kwargs: Unpack[GetDataKwargs]

Returns DatasetItemsListPage

get_dataset

Parameters

optionalkeyword-onlyid: str | None = None

optionalkeyword-onlyname: str | None = None

optionalkeyword-onlyalias: str | None = None

Returns Dataset

get_key_value_store

Parameters

optionalkeyword-onlyid: str | None = None

optionalkeyword-onlyname: str | None = None

init