AdaptivePlaywrightCrawler
Hierarchy
- BasicCrawler
- AdaptivePlaywrightCrawler
Index
Methods
- __init__
- add_requests
- error_handler
- export_data
- export_data_csv
- export_data_json
- failed_request_handler
- get_data
- get_dataset
- get_key_value_store
- get_request_manager
- pre_navigation_hook
- run
- stop
- track_browser_request_handler_runs
- track_http_only_request_handler_runs
- track_rendering_type_mispredictions
- with_beautifulsoup_static_parser
- with_parsel_static_parser
Properties
Methods
__init__
A default constructor. Recommended way to create instance is to call factory methods.
Recommended factory methods:
with_beautifulsoup_static_parser
,with_parsel_static_parser
.Parameters
keyword-onlystatic_parser: AbstractHttpParser[TStaticParseResult, TStaticSelectResult]
Implementation of
AbstractHttpParser
. Parser that will be used for static crawling.optionalkeyword-onlyrendering_type_predictor: RenderingTypePredictor | None = None
Object that implements RenderingTypePredictor and is capable of predicting which rendering method should be used. If None, then
DefaultRenderingTypePredictor
is used.optionalkeyword-onlyresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
Function that evaluates whether crawling result is valid or not.
optionalkeyword-onlyresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
Function that compares two crawling results and decides whether they are equivalent.
optionalkeyword-onlyplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
PlaywrightCrawler
only kwargs that are passed to the sub crawler.optionalkeyword-onlystatistics: Statistics[AdaptivePlaywrightCrawlerStatisticState] | None = None
A custom
Statistics[AdaptivePlaywrightCrawlerStatisticState]
instance, allowing the use of non-default configuration.keyword-onlyoptionalconfiguration: Configuration
The
Configuration
instance. Some of its properties are used as defaults for the crawler.keyword-onlyoptionalevent_manager: EventManager
The event manager for managing events for the crawler and all its components.
keyword-onlyoptionalstorage_client: StorageClient
The storage client for managing storages for the crawler and all its components.
keyword-onlyoptionalrequest_manager: RequestManager
Manager of requests that should be processed by the crawler.
keyword-onlyoptionalsession_pool: SessionPool
A custom
SessionPool
instance, allowing the use of non-default configuration.keyword-onlyoptionalproxy_configuration: ProxyConfiguration
HTTP proxy configuration used when making requests.
keyword-onlyoptionalhttp_client: HttpClient
HTTP client used by
BasicCrawlingContext.send_request
method.keyword-onlyoptionalmax_request_retries: int
Maximum number of attempts to process a single request.
keyword-onlyoptionalmax_requests_per_crawl: int | None
Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers.
None
means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.keyword-onlyoptionalmax_session_rotations: int
Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
keyword-onlyoptionalmax_crawl_depth: int | None
Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
keyword-onlyoptionaluse_session_pool: bool
Enable the use of a session pool for managing sessions during crawling.
keyword-onlyoptionalretry_on_blocked: bool
If True, the crawler attempts to bypass bot protections automatically.
keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
Settings to fine-tune concurrency levels.
keyword-onlyoptionalrequest_handler_timeout: timedelta
Maximum duration allowed for a single request handler to run.
keyword-onlyoptionalabort_on_error: bool
If True, the crawler stops immediately when any request handler error occurs.
keyword-onlyoptionalconfigure_logging: bool
If True, the crawler will set up logging infrastructure automatically.
keyword-onlyoptionalkeep_alive: bool
Flag that can keep crawler running even when there are no requests in queue.
keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
HTTP status codes that are typically considered errors but should be treated as successful responses.
Returns None
add_requests
Add requests to the underlying request manager in batches.
Parameters
requests: Sequence[str | Request]
A list of requests to add to the queue.
optionalkeyword-onlybatch_size: int = 1000
The number of requests to add in one batch.
optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)
Time to wait between adding batches.
optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
If True, wait for all requests to be added before returning.
optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Timeout for waiting for all requests to be added.
Returns None
error_handler
Decorator for configuring an error handler (called after a request handler error and before retrying).
Parameters
handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]
export_data
Export data from a
Dataset
.This helper method simplifies the process of exporting data from a
Dataset
. It opens the specified one and then exports the data based on the provided parameters. If you need to pass options specific to the output format, use theexport_data_csv
orexport_data_json
method instead.Parameters
path: str | Path
The destination path.
optionaldataset_id: str | None = None
The ID of the
Dataset
.optionaldataset_name: str | None = None
The name of the
Dataset
.
Returns None
export_data_csv
Export data from a
Dataset
to a CSV file.This helper method simplifies the process of exporting data from a
Dataset
in csv format. It opens the specified one and then exports the data based on the provided parameters.Parameters
path: str | Path
The destination path.
optionalkeyword-onlydataset_id: str | None = None
The ID of the
Dataset
.optionalkeyword-onlydataset_name: str | None = None
The name of the
Dataset
.keyword-onlyoptionaldialect: str
Specifies a dialect to be used in CSV parsing and writing.
keyword-onlyoptionaldelimiter: str
A one-character string used to separate fields. Defaults to ','.
keyword-onlyoptionaldoublequote: bool
Controls how instances of
quotechar
inside a field should be quoted. When True, the character is doubled; when False, theescapechar
is used as a prefix. Defaults to True.keyword-onlyoptionalescapechar: str
A one-character string used to escape the delimiter if
quoting
is set toQUOTE_NONE
and thequotechar
ifdoublequote
is False. Defaults to None, disabling escaping.keyword-onlyoptionallineterminator: str
The string used to terminate lines produced by the writer. Defaults to '\r\n'.
keyword-onlyoptionalquotechar: str
A one-character string used to quote fields containing special characters, like the delimiter or quotechar, or fields containing new-line characters. Defaults to '"'.
keyword-onlyoptionalquoting: int
Controls when quotes should be generated by the writer and recognized by the reader. Can take any of the
QUOTE_*
constants, with a default ofQUOTE_MINIMAL
.keyword-onlyoptionalskipinitialspace: bool
When True, spaces immediately following the delimiter are ignored. Defaults to False.
keyword-onlyoptionalstrict: bool
When True, raises an exception on bad CSV input. Defaults to False.
Returns None
export_data_json
Export data from a
Dataset
to a JSON file.This helper method simplifies the process of exporting data from a
Dataset
in json format. It opens the specified one and then exports the data based on the provided parameters.Parameters
path: str | Path
The destination path
optionalkeyword-onlydataset_id: str | None = None
The ID of the
Dataset
.optionalkeyword-onlydataset_name: str | None = None
The name of the
Dataset
.keyword-onlyoptionalskipkeys: bool
If True (default: False), dict keys that are not of a basic type (str, int, float, bool, None) will be skipped instead of raising a
TypeError
.keyword-onlyoptionalensure_ascii: bool
Determines if non-ASCII characters should be escaped in the output JSON string.
keyword-onlyoptionalcheck_circular: bool
If False (default: True), skips the circular reference check for container types. A circular reference will result in a
RecursionError
or worse if unchecked.keyword-onlyoptionalallow_nan: bool
If False (default: True), raises a ValueError for out-of-range float values (nan, inf, -inf) to strictly comply with the JSON specification. If True, uses their JavaScript equivalents (NaN, Infinity, -Infinity).
keyword-onlyoptionalcls: type[json.JSONEncoder]
Allows specifying a custom JSON encoder.
keyword-onlyoptionalindent: int
Specifies the number of spaces for indentation in the pretty-printed JSON output.
keyword-onlyoptionalseparators: tuple[str, str]
A tuple of (item_separator, key_separator). The default is (', ', ': ') if indent is None and (',', ': ') otherwise.
keyword-onlyoptionaldefault: Callable
A function called for objects that can't be serialized otherwise. It should return a JSON-encodable version of the object or raise a
TypeError
.keyword-onlyoptionalsort_keys: bool
Specifies whether the output JSON object should have keys sorted alphabetically.
Returns None
failed_request_handler
Decorator for configuring a failed request handler (called after max retries are reached).
Parameters
handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]
get_data
Retrieve data from a
Dataset
.This helper method simplifies the process of retrieving data from a
Dataset
. It opens the specified one and then retrieves the data based on the provided parameters.Parameters
optionaldataset_id: str | None = None
The ID of the
Dataset
.optionaldataset_name: str | None = None
The name of the
Dataset
.keyword-onlyoptionaloffset: int
Skips the specified number of items at the start.
keyword-onlyoptionallimit: int
The maximum number of items to retrieve. Unlimited if None.
keyword-onlyoptionalclean: bool
Return only non-empty items and excludes hidden fields. Shortcut for skip_hidden and skip_empty.
keyword-onlyoptionaldesc: bool
Set to True to sort results in descending order.
keyword-onlyoptionalfields: list[str]
Fields to include in each item. Sorts fields as specified if provided.
keyword-onlyoptionalomit: list[str]
Fields to exclude from each item.
keyword-onlyoptionalunwind: str
Unwinds items by a specified array field, turning each element into a separate item.
keyword-onlyoptionalskip_empty: bool
Excludes empty items from the results if True.
keyword-onlyoptionalskip_hidden: bool
Excludes fields starting with '#' if True.
keyword-onlyoptionalflatten: list[str]
Fields to be flattened in returned items.
keyword-onlyoptionalview: str
Specifies the dataset view to be used.
Returns DatasetItemsListPage
get_dataset
Return the
Dataset
with the given ID or name. If none is provided, return the default one.Parameters
optionalkeyword-onlyid: str | None = None
optionalkeyword-onlyname: str | None = None
Returns Dataset
get_key_value_store
Return the
KeyValueStore
with the given ID or name. If none is provided, return the default KVS.Parameters
optionalkeyword-onlyid: str | None = None
optionalkeyword-onlyname: str | None = None
Returns KeyValueStore
get_request_manager
Return the configured request manager. If none is configured, open and return the default request queue.
Returns RequestManager
pre_navigation_hook
Pre navigation hooks for adaptive crawler are delegated to sub crawlers.
Optionally parametrized decorator. Hooks are wrapped in context that handles possibly missing
page
object by raisingAdaptiveContextError
.Parameters
optionalhook: Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]] | None = None
optionalkeyword-onlyplaywright_only: bool = False
Returns Callable[[Callable[[AdaptivePlaywrightPreNavCrawlingContext], Awaitable[None]]], None]
run
Run the crawler until all requests are processed.
Parameters
optionalrequests: Sequence[str | Request] | None = None
The requests to be enqueued before the crawler starts.
optionalkeyword-onlypurge_request_queue: bool = True
If this is
True
and the crawler is not being run for the first time, the default request queue will be purged.
Returns FinalStatistics
stop
Set flag to stop crawler.
This stops current crawler run regardless of whether all requests were finished.
Parameters
optionalreason: str = 'Stop was called externally.'
Reason for stopping that will be used in logs.
Returns None
track_browser_request_handler_runs
Returns None
track_http_only_request_handler_runs
Returns None
track_rendering_type_mispredictions
Returns None
with_beautifulsoup_static_parser
Creates
AdaptivePlaywrightCrawler
that usesBeautifulSoup
for parsing static content.Parameters
optionalrendering_type_predictor: RenderingTypePredictor | None = None
optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
optionalparser_type: BeautifulSoupParserType = 'lxml'
optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
optionalstatistics: Statistics[StatisticsState] | None = None
keyword-onlyoptionalconfiguration: Configuration
The
Configuration
instance. Some of its properties are used as defaults for the crawler.keyword-onlyoptionalevent_manager: EventManager
The event manager for managing events for the crawler and all its components.
keyword-onlyoptionalstorage_client: StorageClient
The storage client for managing storages for the crawler and all its components.
keyword-onlyoptionalrequest_manager: RequestManager
Manager of requests that should be processed by the crawler.
keyword-onlyoptionalsession_pool: SessionPool
A custom
SessionPool
instance, allowing the use of non-default configuration.keyword-onlyoptionalproxy_configuration: ProxyConfiguration
HTTP proxy configuration used when making requests.
keyword-onlyoptionalhttp_client: HttpClient
HTTP client used by
BasicCrawlingContext.send_request
method.keyword-onlyoptionalmax_request_retries: int
Maximum number of attempts to process a single request.
keyword-onlyoptionalmax_requests_per_crawl: int | None
Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers.
None
means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.keyword-onlyoptionalmax_session_rotations: int
Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
keyword-onlyoptionalmax_crawl_depth: int | None
Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
keyword-onlyoptionaluse_session_pool: bool
Enable the use of a session pool for managing sessions during crawling.
keyword-onlyoptionalretry_on_blocked: bool
If True, the crawler attempts to bypass bot protections automatically.
keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
Settings to fine-tune concurrency levels.
keyword-onlyoptionalrequest_handler_timeout: timedelta
Maximum duration allowed for a single request handler to run.
keyword-onlyoptionalabort_on_error: bool
If True, the crawler stops immediately when any request handler error occurs.
keyword-onlyoptionalconfigure_logging: bool
If True, the crawler will set up logging infrastructure automatically.
keyword-onlyoptionalkeep_alive: bool
Flag that can keep crawler running even when there are no requests in queue.
keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
HTTP status codes that are typically considered errors but should be treated as successful responses.
Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[BeautifulSoup], BeautifulSoup, Tag]
with_parsel_static_parser
Creates
AdaptivePlaywrightCrawler
that usesParcel
for parsing static content.Parameters
optionalrendering_type_predictor: RenderingTypePredictor | None = None
optionalresult_checker: Callable[[RequestHandlerRunResult], bool] | None = None
optionalresult_comparator: Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool] | None = None
optionalplaywright_crawler_specific_kwargs: _PlaywrightCrawlerAdditionalOptions | None = None
optionalstatistics: Statistics[StatisticsState] | None = None
keyword-onlyoptionalconfiguration: Configuration
The
Configuration
instance. Some of its properties are used as defaults for the crawler.keyword-onlyoptionalevent_manager: EventManager
The event manager for managing events for the crawler and all its components.
keyword-onlyoptionalstorage_client: StorageClient
The storage client for managing storages for the crawler and all its components.
keyword-onlyoptionalrequest_manager: RequestManager
Manager of requests that should be processed by the crawler.
keyword-onlyoptionalsession_pool: SessionPool
A custom
SessionPool
instance, allowing the use of non-default configuration.keyword-onlyoptionalproxy_configuration: ProxyConfiguration
HTTP proxy configuration used when making requests.
keyword-onlyoptionalhttp_client: HttpClient
HTTP client used by
BasicCrawlingContext.send_request
method.keyword-onlyoptionalmax_request_retries: int
Maximum number of attempts to process a single request.
keyword-onlyoptionalmax_requests_per_crawl: int | None
Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers.
None
means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.keyword-onlyoptionalmax_session_rotations: int
Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
keyword-onlyoptionalmax_crawl_depth: int | None
Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
keyword-onlyoptionaluse_session_pool: bool
Enable the use of a session pool for managing sessions during crawling.
keyword-onlyoptionalretry_on_blocked: bool
If True, the crawler attempts to bypass bot protections automatically.
keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
Settings to fine-tune concurrency levels.
keyword-onlyoptionalrequest_handler_timeout: timedelta
Maximum duration allowed for a single request handler to run.
keyword-onlyoptionalabort_on_error: bool
If True, the crawler stops immediately when any request handler error occurs.
keyword-onlyoptionalconfigure_logging: bool
If True, the crawler will set up logging infrastructure automatically.
keyword-onlyoptionalkeep_alive: bool
Flag that can keep crawler running even when there are no requests in queue.
keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
HTTP status codes that are typically considered errors but should be treated as successful responses.
Returns AdaptivePlaywrightCrawler[ParsedHttpCrawlingContext[Selector], Selector, Selector]
Properties
log
The logger used by the crawler.
router
The Router
used to handle each individual crawling request.
statistics
Statistics about the current (or last) crawler run.
An adaptive web crawler capable of using both static HTTP request based crawling and browser based crawling.
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit. It uses specific implementation of
AbstractHttpCrawler
andPlaywrightCrawler
.Usage