AdaptivePlaywrightCrawler
Hierarchy
- BasicCrawler- AdaptivePlaywrightCrawler
 
Index
Methods
- __init__
- add_requests
- error_handler
- export_data
- failed_request_handler
- get_data
- get_dataset
- get_key_value_store
- get_request_manager
- on_skipped_request
- pre_navigation_hook
- run
- stop
- track_browser_request_handler_runs
- track_http_only_request_handler_runs
- track_rendering_type_mispredictions
- with_beautifulsoup_static_parser
- with_parsel_static_parser
Properties
Methods
__init__
- Initialize a new instance. Recommended way to create instance is to call factory methods. - Recommended factory methods: - with_beautifulsoup_static_parser,- with_parsel_static_parser.- Parameters- keyword-onlystatic_parser: AbstractHttpParser[TStaticParseResult, TStaticSelectResult]- Implementation of - AbstractHttpParser. Parser that will be used for static crawling.
- optionalkeyword-onlyrendering_type_predictor: RenderingTypePredictor | None = None- Object that implements RenderingTypePredictor and is capable of predicting which rendering method should be used. If None, then - DefaultRenderingTypePredictoris used.
- optionalkeyword-onlyresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None- Function that evaluates whether crawling result is valid or not. 
- optionalkeyword-onlyresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None- Function that compares two crawling results and decides whether they are equivalent. 
- optionalkeyword-onlyplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None- PlaywrightCrawleronly kwargs that are passed to the sub crawler.
- optionalkeyword-onlystatistics: Statistics[AdaptivePlaywrightCrawlerStatisticState] | None = None- A custom - Statistics[AdaptivePlaywrightCrawlerStatisticState]instance, allowing the use of non-default configuration.
- keyword-onlyoptionalconfiguration: Configuration- The - Configurationinstance. Some of its properties are used as defaults for the crawler.
- keyword-onlyoptionalevent_manager: EventManager- The event manager for managing events for the crawler and all its components. 
- keyword-onlyoptionalstorage_client: StorageClient- The storage client for managing storages for the crawler and all its components. 
- keyword-onlyoptionalrequest_manager: RequestManager- Manager of requests that should be processed by the crawler. 
- keyword-onlyoptionalsession_pool: SessionPool- A custom - SessionPoolinstance, allowing the use of non-default configuration.
- keyword-onlyoptionalproxy_configuration: ProxyConfiguration- HTTP proxy configuration used when making requests. 
- keyword-onlyoptionalhttp_client: HttpClient- HTTP client used by - BasicCrawlingContext.send_requestmethod.
- keyword-onlyoptionalmax_request_retries: int- Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions ( - request_handler,- pre_navigation_hooksetc.).- This limit does not apply to retries triggered by session rotation (see - max_session_rotations).
- keyword-onlyoptionalmax_requests_per_crawl: int | None- Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. - Nonemeans no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
- keyword-onlyoptionalmax_session_rotations: int- Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request. - The session rotations are not counted towards the - max_request_retrieslimit.
- keyword-onlyoptionalmax_crawl_depth: int | None- Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions. 
- keyword-onlyoptionaluse_session_pool: bool- Enable the use of a session pool for managing sessions during crawling. 
- keyword-onlyoptionalretry_on_blocked: bool- If True, the crawler attempts to bypass bot protections automatically. 
- keyword-onlyoptionalconcurrency_settings: ConcurrencySettings- Settings to fine-tune concurrency levels. 
- keyword-onlyoptionalrequest_handler_timeout: timedelta- Maximum duration allowed for a single request handler to run. 
- keyword-onlyoptionalabort_on_error: bool- If True, the crawler stops immediately when any request handler error occurs. 
- keyword-onlyoptionalconfigure_logging: bool- If True, the crawler will set up logging infrastructure automatically. 
- keyword-onlyoptionalstatistics_log_format: Literal[table, inline]- If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages. 
- keyword-onlyoptionalkeep_alive: bool- Flag that can keep crawler running even when there are no requests in queue. 
- keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]- Additional HTTP status codes to treat as errors, triggering automatic retries when encountered. 
- keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]- HTTP status codes that are typically considered errors but should be treated as successful responses. 
- keyword-onlyoptionalrespect_robots_txt_file: bool- If set to - True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via- EnqueueLinksFunction.
- keyword-onlyoptionalstatus_message_logging_interval: timedelta- Interval for logging the crawler status messages. 
- keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]- Allows overriding the default status message. The default status message is provided in the parameters. Returning - Nonesuppresses the status message.
 - Returns None
add_requests
- Add requests to the underlying request manager in batches. - Parameters- requests: Sequence[str | Request]- A list of requests to add to the queue. 
- optionalkeyword-onlyforefront: bool = False- If True, add requests to the forefront of the queue. 
- optionalkeyword-onlybatch_size: int = 1000- The number of requests to add in one batch. 
- optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)- Time to wait between adding batches. 
- optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False- If True, wait for all requests to be added before returning. 
- optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None- Timeout for waiting for all requests to be added. 
 - Returns None
error_handler
- Register a function to handle errors occurring in request handlers. - The error handler is invoked after a request handler error occurs and before a retry attempt. - Parameters- handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
 - Returns ErrorHandler[TCrawlingContext]
export_data
- Export all items from a Dataset to a JSON or CSV file. - This method simplifies the process of exporting data collected during crawling. It automatically determines the export format based on the file extension ( - .jsonor- .csv) and handles the conversion of- Datasetitems to the appropriate format.- Parameters- path: str | Path- The destination file path. Must end with '.json' or '.csv'. 
- optionaldataset_id: str | None = None- The ID of the Dataset to export from. 
- optionaldataset_name: str | None = None- The name of the Dataset to export from (global scope, named storage). 
- optionaldataset_alias: str | None = None- The alias of the Dataset to export from (run scope, unnamed storage). 
 - Returns None
failed_request_handler
- Register a function to handle requests that exceed the maximum retry limit. - The failed request handler is invoked when a request has failed all retry attempts. - Parameters- handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
 - Returns FailedRequestHandler[TCrawlingContext]
get_data
- Retrieve data from a - Dataset.- This helper method simplifies the process of retrieving data from a - Dataset. It opens the specified one and then retrieves the data based on the provided parameters.- Parameters- optionaldataset_id: str | None = None- The ID of the - Dataset.
- optionaldataset_name: str | None = None- The name of the - Dataset(global scope, named storage).
- optionaldataset_alias: str | None = None- The alias of the - Dataset(run scope, unnamed storage).
- kwargs: Unpack[GetDataKwargs]- Keyword arguments to be passed to the - Dataset.get_data()method.
 - Returns DatasetItemsListPage
get_dataset
- Return the - Datasetwith the given ID or name. If none is provided, return the default one.- Parameters- optionalkeyword-onlyid: str | None = None
- optionalkeyword-onlyname: str | None = None
- optionalkeyword-onlyalias: str | None = None
 - Returns Dataset
get_key_value_store
- Return the - KeyValueStorewith the given ID or name. If none is provided, return the default KVS.- Parameters- optionalkeyword-onlyid: str | None = None
- optionalkeyword-onlyname: str | None = None
- optionalkeyword-onlyalias: str | None = None
 - Returns KeyValueStore
get_request_manager
- Return the configured request manager. If none is configured, open and return the default request queue. - Returns RequestManager
on_skipped_request
- Register a function to handle skipped requests. - The skipped request handler is invoked when a request is skipped due to a collision or other reasons. - Parameters- callback: SkippedRequestCallback
 - Returns SkippedRequestCallback
pre_navigation_hook
- Pre navigation hooks for adaptive crawler are delegated to sub crawlers. - Optionally parametrized decorator. Hooks are wrapped in context that handles possibly missing - pageobject by raising- AdaptiveContextError.- Parameters- optionalhook: Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]] | None = None
- optionalkeyword-onlyplaywright_only: bool = False
 - Returns Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]
run
- Run the crawler until all requests are processed. - Parameters- optionalrequests: Sequence[str | Request] | None = None- The requests to be enqueued before the crawler starts. 
- optionalkeyword-onlypurge_request_queue: bool = True- If this is - Trueand the crawler is not being run for the first time, the default request queue will be purged.
 - Returns FinalStatistics
stop
- Set flag to stop crawler. - This stops current crawler run regardless of whether all requests were finished. - Parameters- optionalreason: str = 'Stop was called externally.'- Reason for stopping that will be used in logs. 
 - Returns None
track_browser_request_handler_runs
- Returns None
track_http_only_request_handler_runs
- Returns None
track_rendering_type_mispredictions
- Returns None
with_beautifulsoup_static_parser
- Create - AdaptivePlaywrightCrawlerthat uses- BeautifulSoupfor parsing static content.- Parameters- optionalrendering_type_predictor: RenderingTypePredictor | None = None
- optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
- optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
- optionalparser_type: BeautifulSoupParserType = 'lxml'
- optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
- optionalstatistics: Statistics[StatisticsState] | None = None
- keyword-onlyoptionalconfiguration: Configuration- The - Configurationinstance. Some of its properties are used as defaults for the crawler.
- keyword-onlyoptionalevent_manager: EventManager- The event manager for managing events for the crawler and all its components. 
- keyword-onlyoptionalstorage_client: StorageClient- The storage client for managing storages for the crawler and all its components. 
- keyword-onlyoptionalrequest_manager: RequestManager- Manager of requests that should be processed by the crawler. 
- keyword-onlyoptionalsession_pool: SessionPool- A custom - SessionPoolinstance, allowing the use of non-default configuration.
- keyword-onlyoptionalproxy_configuration: ProxyConfiguration- HTTP proxy configuration used when making requests. 
- keyword-onlyoptionalhttp_client: HttpClient- HTTP client used by - BasicCrawlingContext.send_requestmethod.
- keyword-onlyoptionalmax_request_retries: int- Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions ( - request_handler,- pre_navigation_hooksetc.).- This limit does not apply to retries triggered by session rotation (see - max_session_rotations).
- keyword-onlyoptionalmax_requests_per_crawl: int | None- Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. - Nonemeans no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
- keyword-onlyoptionalmax_session_rotations: int- Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request. - The session rotations are not counted towards the - max_request_retrieslimit.
- keyword-onlyoptionalmax_crawl_depth: int | None- Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions. 
- keyword-onlyoptionaluse_session_pool: bool- Enable the use of a session pool for managing sessions during crawling. 
- keyword-onlyoptionalretry_on_blocked: bool- If True, the crawler attempts to bypass bot protections automatically. 
- keyword-onlyoptionalconcurrency_settings: ConcurrencySettings- Settings to fine-tune concurrency levels. 
- keyword-onlyoptionalrequest_handler_timeout: timedelta- Maximum duration allowed for a single request handler to run. 
- keyword-onlyoptionalabort_on_error: bool- If True, the crawler stops immediately when any request handler error occurs. 
- keyword-onlyoptionalconfigure_logging: bool- If True, the crawler will set up logging infrastructure automatically. 
- keyword-onlyoptionalstatistics_log_format: Literal[table, inline]- If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages. 
- keyword-onlyoptionalkeep_alive: bool- Flag that can keep crawler running even when there are no requests in queue. 
- keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]- Additional HTTP status codes to treat as errors, triggering automatic retries when encountered. 
- keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]- HTTP status codes that are typically considered errors but should be treated as successful responses. 
- keyword-onlyoptionalrespect_robots_txt_file: bool- If set to - True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via- EnqueueLinksFunction.
- keyword-onlyoptionalstatus_message_logging_interval: timedelta- Interval for logging the crawler status messages. 
- keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]- Allows overriding the default status message. The default status message is provided in the parameters. Returning - Nonesuppresses the status message.
 - Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[BeautifulSoup], BeautifulSoup, Tag]
with_parsel_static_parser
- Create - AdaptivePlaywrightCrawlerthat uses- Parcelfor parsing static content.- Parameters- optionalrendering_type_predictor: RenderingTypePredictor | None = None
- optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
- optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
- optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
- optionalstatistics: Statistics[StatisticsState] | None = None
- keyword-onlyoptionalconfiguration: Configuration- The - Configurationinstance. Some of its properties are used as defaults for the crawler.
- keyword-onlyoptionalevent_manager: EventManager- The event manager for managing events for the crawler and all its components. 
- keyword-onlyoptionalstorage_client: StorageClient- The storage client for managing storages for the crawler and all its components. 
- keyword-onlyoptionalrequest_manager: RequestManager- Manager of requests that should be processed by the crawler. 
- keyword-onlyoptionalsession_pool: SessionPool- A custom - SessionPoolinstance, allowing the use of non-default configuration.
- keyword-onlyoptionalproxy_configuration: ProxyConfiguration- HTTP proxy configuration used when making requests. 
- keyword-onlyoptionalhttp_client: HttpClient- HTTP client used by - BasicCrawlingContext.send_requestmethod.
- keyword-onlyoptionalmax_request_retries: int- Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions ( - request_handler,- pre_navigation_hooksetc.).- This limit does not apply to retries triggered by session rotation (see - max_session_rotations).
- keyword-onlyoptionalmax_requests_per_crawl: int | None- Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. - Nonemeans no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
- keyword-onlyoptionalmax_session_rotations: int- Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request. - The session rotations are not counted towards the - max_request_retrieslimit.
- keyword-onlyoptionalmax_crawl_depth: int | None- Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions. 
- keyword-onlyoptionaluse_session_pool: bool- Enable the use of a session pool for managing sessions during crawling. 
- keyword-onlyoptionalretry_on_blocked: bool- If True, the crawler attempts to bypass bot protections automatically. 
- keyword-onlyoptionalconcurrency_settings: ConcurrencySettings- Settings to fine-tune concurrency levels. 
- keyword-onlyoptionalrequest_handler_timeout: timedelta- Maximum duration allowed for a single request handler to run. 
- keyword-onlyoptionalabort_on_error: bool- If True, the crawler stops immediately when any request handler error occurs. 
- keyword-onlyoptionalconfigure_logging: bool- If True, the crawler will set up logging infrastructure automatically. 
- keyword-onlyoptionalstatistics_log_format: Literal[table, inline]- If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages. 
- keyword-onlyoptionalkeep_alive: bool- Flag that can keep crawler running even when there are no requests in queue. 
- keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]- Additional HTTP status codes to treat as errors, triggering automatic retries when encountered. 
- keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]- HTTP status codes that are typically considered errors but should be treated as successful responses. 
- keyword-onlyoptionalrespect_robots_txt_file: bool- If set to - True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via- EnqueueLinksFunction.
- keyword-onlyoptionalstatus_message_logging_interval: timedelta- Interval for logging the crawler status messages. 
- keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]- Allows overriding the default status message. The default status message is provided in the parameters. Returning - Nonesuppresses the status message.
 - Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[Selector], Selector, Selector]
Properties
log
The logger used by the crawler.
router
The Router used to handle each individual crawling request.
statistics
Statistics about the current (or last) crawler run.
An adaptive web crawler capable of using both static HTTP request based crawling and browser based crawling.
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit. It uses specific implementation of
AbstractHttpCrawlerandPlaywrightCrawler.Usage