Skip to main content

AdaptivePlaywrightCrawler

An adaptive web crawler capable of using both static HTTP request based crawling and browser based crawling.

It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit. It uses specific implementation of AbstractHttpCrawler and PlaywrightCrawler.

Usage

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext

crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'browser_type': 'chromium'}
)

@crawler.router.default_handler
async def request_handler_for_label(context: AdaptivePlaywrightCrawlingContext) -> None:
# Do some processing using `parsed_content`
context.log.info(context.parsed_content.title)

# Locate element h2 within 5 seconds
h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
# Do stuff with element found by the selector
context.log.info(h2)

# Find more links and enqueue them.
await context.enqueue_links()
# Save some data.
await context.push_data({'Visited url': context.request.url})

await crawler.run(['https://crawlee.dev/'])

Hierarchy

Index

Methods

__init__

  • __init__(*, static_parser, rendering_type_predictor, result_checker, result_comparator, playwright_crawler_specific_kwargs, statistics, kwargs): None
  • A default constructor. Recommended way to create instance is to call factory methods.

    Recommended factory methods: with_beautifulsoup_static_parser, with_parsel_static_parser.


    Parameters

    • keyword-onlystatic_parser: AbstractHttpParser[TStaticParseResult, TStaticSelectResult]

      Implementation of AbstractHttpParser. Parser that will be used for static crawling.

    • optionalkeyword-onlyrendering_type_predictor: RenderingTypePredictor | None = None

      Object that implements RenderingTypePredictor and is capable of predicting which rendering method should be used. If None, then DefaultRenderingTypePredictor is used.

    • optionalkeyword-onlyresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None

      Function that evaluates whether crawling result is valid or not.

    • optionalkeyword-onlyresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None

      Function that compares two crawling results and decides whether they are equivalent.

    • optionalkeyword-onlyplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None

      PlaywrightCrawler only kwargs that are passed to the sub crawler.

    • optionalkeyword-onlystatistics: Statistics[AdaptivePlaywrightCrawlerStatisticState] | None = None

      A custom Statistics[AdaptivePlaywrightCrawlerStatisticState] instance, allowing the use of non-default configuration.

    • keyword-onlyoptionalconfiguration: Configuration

      The Configuration instance. Some of its properties are used as defaults for the crawler.

    • keyword-onlyoptionalevent_manager: EventManager

      The event manager for managing events for the crawler and all its components.

    • keyword-onlyoptionalstorage_client: StorageClient

      The storage client for managing storages for the crawler and all its components.

    • keyword-onlyoptionalrequest_manager: RequestManager

      Manager of requests that should be processed by the crawler.

    • keyword-onlyoptionalsession_pool: SessionPool

      A custom SessionPool instance, allowing the use of non-default configuration.

    • keyword-onlyoptionalproxy_configuration: ProxyConfiguration

      HTTP proxy configuration used when making requests.

    • keyword-onlyoptionalhttp_client: HttpClient

      HTTP client used by BasicCrawlingContext.send_request method.

    • keyword-onlyoptionalmax_request_retries: int

      Maximum number of attempts to process a single request.

    • keyword-onlyoptionalmax_requests_per_crawl: int | None

      Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

    • keyword-onlyoptionalmax_session_rotations: int

      Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

    • keyword-onlyoptionalmax_crawl_depth: int | None

      Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

    • keyword-onlyoptionaluse_session_pool: bool

      Enable the use of a session pool for managing sessions during crawling.

    • keyword-onlyoptionalretry_on_blocked: bool

      If True, the crawler attempts to bypass bot protections automatically.

    • keyword-onlyoptionalconcurrency_settings: ConcurrencySettings

      Settings to fine-tune concurrency levels.

    • keyword-onlyoptionalrequest_handler_timeout: timedelta

      Maximum duration allowed for a single request handler to run.

    • keyword-onlyoptionalabort_on_error: bool

      If True, the crawler stops immediately when any request handler error occurs.

    • keyword-onlyoptionalconfigure_logging: bool

      If True, the crawler will set up logging infrastructure automatically.

    • keyword-onlyoptionalkeep_alive: bool

      Flag that can keep crawler running even when there are no requests in queue.

    • keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]

      Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

    • keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]

      HTTP status codes that are typically considered errors but should be treated as successful responses.

    Returns None

add_requests

  • async add_requests(requests, *, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying request manager in batches.


    Parameters

    • requests: Sequence[str | Request]

      A list of requests to add to the queue.

    • optionalkeyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)

      Time to wait between adding batches.

    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

error_handler

export_data

  • async export_data(path, dataset_id, dataset_name): None
  • Export data from a Dataset.

    This helper method simplifies the process of exporting data from a Dataset. It opens the specified one and then exports the data based on the provided parameters. If you need to pass options specific to the output format, use the export_data_csv or export_data_json method instead.


    Parameters

    • path: str | Path

      The destination path.

    • optionaldataset_id: str | None = None

      The ID of the Dataset.

    • optionaldataset_name: str | None = None

      The name of the Dataset.

    Returns None

export_data_csv

  • async export_data_csv(path, *, dataset_id, dataset_name, kwargs): None
  • Export data from a Dataset to a CSV file.

    This helper method simplifies the process of exporting data from a Dataset in csv format. It opens the specified one and then exports the data based on the provided parameters.


    Parameters

    • path: str | Path

      The destination path.

    • optionalkeyword-onlydataset_id: str | None = None

      The ID of the Dataset.

    • optionalkeyword-onlydataset_name: str | None = None

      The name of the Dataset.

    • keyword-onlyoptionaldialect: str

      Specifies a dialect to be used in CSV parsing and writing.

    • keyword-onlyoptionaldelimiter: str

      A one-character string used to separate fields. Defaults to ','.

    • keyword-onlyoptionaldoublequote: bool

      Controls how instances of quotechar inside a field should be quoted. When True, the character is doubled; when False, the escapechar is used as a prefix. Defaults to True.

    • keyword-onlyoptionalescapechar: str

      A one-character string used to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. Defaults to None, disabling escaping.

    • keyword-onlyoptionallineterminator: str

      The string used to terminate lines produced by the writer. Defaults to '\r\n'.

    • keyword-onlyoptionalquotechar: str

      A one-character string used to quote fields containing special characters, like the delimiter or quotechar, or fields containing new-line characters. Defaults to '"'.

    • keyword-onlyoptionalquoting: int

      Controls when quotes should be generated by the writer and recognized by the reader. Can take any of the QUOTE_* constants, with a default of QUOTE_MINIMAL.

    • keyword-onlyoptionalskipinitialspace: bool

      When True, spaces immediately following the delimiter are ignored. Defaults to False.

    • keyword-onlyoptionalstrict: bool

      When True, raises an exception on bad CSV input. Defaults to False.

    Returns None

export_data_json

  • async export_data_json(path, *, dataset_id, dataset_name, kwargs): None
  • Export data from a Dataset to a JSON file.

    This helper method simplifies the process of exporting data from a Dataset in json format. It opens the specified one and then exports the data based on the provided parameters.


    Parameters

    • path: str | Path

      The destination path

    • optionalkeyword-onlydataset_id: str | None = None

      The ID of the Dataset.

    • optionalkeyword-onlydataset_name: str | None = None

      The name of the Dataset.

    • keyword-onlyoptionalskipkeys: bool

      If True (default: False), dict keys that are not of a basic type (str, int, float, bool, None) will be skipped instead of raising a TypeError.

    • keyword-onlyoptionalensure_ascii: bool

      Determines if non-ASCII characters should be escaped in the output JSON string.

    • keyword-onlyoptionalcheck_circular: bool

      If False (default: True), skips the circular reference check for container types. A circular reference will result in a RecursionError or worse if unchecked.

    • keyword-onlyoptionalallow_nan: bool

      If False (default: True), raises a ValueError for out-of-range float values (nan, inf, -inf) to strictly comply with the JSON specification. If True, uses their JavaScript equivalents (NaN, Infinity, -Infinity).

    • keyword-onlyoptionalcls: type[json.JSONEncoder]

      Allows specifying a custom JSON encoder.

    • keyword-onlyoptionalindent: int

      Specifies the number of spaces for indentation in the pretty-printed JSON output.

    • keyword-onlyoptionalseparators: tuple[str, str]

      A tuple of (item_separator, key_separator). The default is (', ', ': ') if indent is None and (',', ': ') otherwise.

    • keyword-onlyoptionaldefault: Callable

      A function called for objects that can't be serialized otherwise. It should return a JSON-encodable version of the object or raise a TypeError.

    • keyword-onlyoptionalsort_keys: bool

      Specifies whether the output JSON object should have keys sorted alphabetically.

    Returns None

failed_request_handler

get_data

  • Retrieve data from a Dataset.

    This helper method simplifies the process of retrieving data from a Dataset. It opens the specified one and then retrieves the data based on the provided parameters.


    Parameters

    • optionaldataset_id: str | None = None

      The ID of the Dataset.

    • optionaldataset_name: str | None = None

      The name of the Dataset.

    • keyword-onlyoptionaloffset: int

      Skips the specified number of items at the start.

    • keyword-onlyoptionallimit: int

      The maximum number of items to retrieve. Unlimited if None.

    • keyword-onlyoptionalclean: bool

      Return only non-empty items and excludes hidden fields. Shortcut for skip_hidden and skip_empty.

    • keyword-onlyoptionaldesc: bool

      Set to True to sort results in descending order.

    • keyword-onlyoptionalfields: list[str]

      Fields to include in each item. Sorts fields as specified if provided.

    • keyword-onlyoptionalomit: list[str]

      Fields to exclude from each item.

    • keyword-onlyoptionalunwind: str

      Unwinds items by a specified array field, turning each element into a separate item.

    • keyword-onlyoptionalskip_empty: bool

      Excludes empty items from the results if True.

    • keyword-onlyoptionalskip_hidden: bool

      Excludes fields starting with '#' if True.

    • keyword-onlyoptionalflatten: list[str]

      Fields to be flattened in returned items.

    • keyword-onlyoptionalview: str

      Specifies the dataset view to be used.

    Returns DatasetItemsListPage

get_dataset

  • async get_dataset(*, id, name): Dataset
  • Return the Dataset with the given ID or name. If none is provided, return the default one.


    Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None

    Returns Dataset

get_key_value_store

  • Return the KeyValueStore with the given ID or name. If none is provided, return the default KVS.


    Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None

    Returns KeyValueStore

get_request_manager

pre_navigation_hook

  • pre_navigation_hook(hook, *, playwright_only): Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]
  • Pre navigation hooks for adaptive crawler are delegated to sub crawlers.

    Optionally parametrized decorator. Hooks are wrapped in context that handles possibly missing page object by raising AdaptiveContextError.


    Parameters

    • optionalhook: Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]] | None = None
    • optionalkeyword-onlyplaywright_only: bool = False

    Returns Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]

run

  • Run the crawler until all requests are processed.


    Parameters

    • optionalrequests: Sequence[str | Request] | None = None

      The requests to be enqueued before the crawler starts.

    • optionalkeyword-onlypurge_request_queue: bool = True

      If this is True and the crawler is not being run for the first time, the default request queue will be purged.

    Returns FinalStatistics

stop

  • stop(reason): None
  • Set flag to stop crawler.

    This stops current crawler run regardless of whether all requests were finished.


    Parameters

    • optionalreason: str = 'Stop was called externally.'

      Reason for stopping that will be used in logs.

    Returns None

track_browser_request_handler_runs

  • track_browser_request_handler_runs(): None
  • Returns None

track_http_only_request_handler_runs

  • track_http_only_request_handler_runs(): None
  • Returns None

track_rendering_type_mispredictions

  • track_rendering_type_mispredictions(): None
  • Returns None

with_beautifulsoup_static_parser

  • with_beautifulsoup_static_parser(rendering_type_predictor, result_checker, result_comparator, parser_type, playwright_crawler_specific_kwargs, statistics, kwargs): AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[BeautifulSoup], BeautifulSoup, Tag]
  • Creates AdaptivePlaywrightCrawler that uses BeautifulSoup for parsing static content.


    Parameters

    • optionalrendering_type_predictor: RenderingTypePredictor | None = None
    • optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
    • optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
    • optionalparser_type: BeautifulSoupParserType = 'lxml'
    • optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
    • optionalstatistics: Statistics[StatisticsState] | None = None
    • keyword-onlyoptionalconfiguration: Configuration

      The Configuration instance. Some of its properties are used as defaults for the crawler.

    • keyword-onlyoptionalevent_manager: EventManager

      The event manager for managing events for the crawler and all its components.

    • keyword-onlyoptionalstorage_client: StorageClient

      The storage client for managing storages for the crawler and all its components.

    • keyword-onlyoptionalrequest_manager: RequestManager

      Manager of requests that should be processed by the crawler.

    • keyword-onlyoptionalsession_pool: SessionPool

      A custom SessionPool instance, allowing the use of non-default configuration.

    • keyword-onlyoptionalproxy_configuration: ProxyConfiguration

      HTTP proxy configuration used when making requests.

    • keyword-onlyoptionalhttp_client: HttpClient

      HTTP client used by BasicCrawlingContext.send_request method.

    • keyword-onlyoptionalmax_request_retries: int

      Maximum number of attempts to process a single request.

    • keyword-onlyoptionalmax_requests_per_crawl: int | None

      Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

    • keyword-onlyoptionalmax_session_rotations: int

      Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

    • keyword-onlyoptionalmax_crawl_depth: int | None

      Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

    • keyword-onlyoptionaluse_session_pool: bool

      Enable the use of a session pool for managing sessions during crawling.

    • keyword-onlyoptionalretry_on_blocked: bool

      If True, the crawler attempts to bypass bot protections automatically.

    • keyword-onlyoptionalconcurrency_settings: ConcurrencySettings

      Settings to fine-tune concurrency levels.

    • keyword-onlyoptionalrequest_handler_timeout: timedelta

      Maximum duration allowed for a single request handler to run.

    • keyword-onlyoptionalabort_on_error: bool

      If True, the crawler stops immediately when any request handler error occurs.

    • keyword-onlyoptionalconfigure_logging: bool

      If True, the crawler will set up logging infrastructure automatically.

    • keyword-onlyoptionalkeep_alive: bool

      Flag that can keep crawler running even when there are no requests in queue.

    • keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]

      Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

    • keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]

      HTTP status codes that are typically considered errors but should be treated as successful responses.

    Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[BeautifulSoup], BeautifulSoup, Tag]

with_parsel_static_parser

  • Creates AdaptivePlaywrightCrawler that uses Parcel for parsing static content.


    Parameters

    • optionalrendering_type_predictor: RenderingTypePredictor | None = None
    • optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
    • optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
    • optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
    • optionalstatistics: Statistics[StatisticsState] | None = None
    • keyword-onlyoptionalconfiguration: Configuration

      The Configuration instance. Some of its properties are used as defaults for the crawler.

    • keyword-onlyoptionalevent_manager: EventManager

      The event manager for managing events for the crawler and all its components.

    • keyword-onlyoptionalstorage_client: StorageClient

      The storage client for managing storages for the crawler and all its components.

    • keyword-onlyoptionalrequest_manager: RequestManager

      Manager of requests that should be processed by the crawler.

    • keyword-onlyoptionalsession_pool: SessionPool

      A custom SessionPool instance, allowing the use of non-default configuration.

    • keyword-onlyoptionalproxy_configuration: ProxyConfiguration

      HTTP proxy configuration used when making requests.

    • keyword-onlyoptionalhttp_client: HttpClient

      HTTP client used by BasicCrawlingContext.send_request method.

    • keyword-onlyoptionalmax_request_retries: int

      Maximum number of attempts to process a single request.

    • keyword-onlyoptionalmax_requests_per_crawl: int | None

      Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

    • keyword-onlyoptionalmax_session_rotations: int

      Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

    • keyword-onlyoptionalmax_crawl_depth: int | None

      Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

    • keyword-onlyoptionaluse_session_pool: bool

      Enable the use of a session pool for managing sessions during crawling.

    • keyword-onlyoptionalretry_on_blocked: bool

      If True, the crawler attempts to bypass bot protections automatically.

    • keyword-onlyoptionalconcurrency_settings: ConcurrencySettings

      Settings to fine-tune concurrency levels.

    • keyword-onlyoptionalrequest_handler_timeout: timedelta

      Maximum duration allowed for a single request handler to run.

    • keyword-onlyoptionalabort_on_error: bool

      If True, the crawler stops immediately when any request handler error occurs.

    • keyword-onlyoptionalconfigure_logging: bool

      If True, the crawler will set up logging infrastructure automatically.

    • keyword-onlyoptionalkeep_alive: bool

      Flag that can keep crawler running even when there are no requests in queue.

    • keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]

      Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

    • keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]

      HTTP status codes that are typically considered errors but should be treated as successful responses.

    Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[Selector], Selector, Selector]

Properties

log

log: logging.Logger

The logger used by the crawler.

router

The Router used to handle each individual crawling request.

statistics

Statistics about the current (or last) crawler run.