PlaywrightCrawlerOptions

Arguments for the AbstractHttpCrawler constructor.

It is intended for typing forwarded __init__ arguments in the subclasses.

Hierarchy

_PlaywrightCrawlerAdditionalOptions
BasicCrawlerOptions
- PlaywrightCrawlerOptions

Properties

abort_on_error

abort_on_error: NotRequired[bool]

If True, the crawler stops immediately when any request handler error occurs.

additional_http_error_status_codes

additional_http_error_status_codes: NotRequired[Iterable[int]]

Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

browser_launch_options

browser_launch_options: NotRequired[Mapping[str, Any]]

Keyword arguments to pass to the browser launch method. These options are provided directly to Playwright's browser_type.launch method. For more details, refer to the Playwright documentation: https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch. This option should not be used if browser_pool is provided.

browser_new_context_options

browser_new_context_options: NotRequired[Mapping[str, Any]]

Keyword arguments to pass to the browser new context method. These options are provided directly to Playwright's browser.new_context method. For more details, refer to the Playwright documentation: https://playwright.dev/python/docs/api/class-browser#browser-new-context. This option should not be used if browser_pool is provided.

browser_pool

browser_pool: NotRequired[BrowserPool]

A BrowserPool instance to be used for launching the browsers and getting pages.

browser_type

browser_type: NotRequired[BrowserType]

The type of browser to launch:

'chromium', 'firefox', 'webkit': Use Playwright-managed browsers
'chrome': Use your locally installed Google Chrome browser. Requires Google Chrome to be installed on the system. This option should not be used if browser_pool is provided.

concurrency_settings

concurrency_settings: NotRequired[ConcurrencySettings]

Settings to fine-tune concurrency levels.

configuration

configuration: NotRequired[Configuration]

The Configuration instance. Some of its properties are used as defaults for the crawler.

configure_logging

configure_logging: NotRequired[bool]

If True, the crawler will set up logging infrastructure automatically.

event_manager

event_manager: NotRequired[EventManager]

The event manager for managing events for the crawler and all its components.

headless

headless: NotRequired[bool]

Whether to run the browser in headless mode. This option should not be used if browser_pool is provided.

http_client

http_client: NotRequired[HttpClient]

HTTP client used by BasicCrawlingContext.send_request method.

ignore_http_error_status_codes

ignore_http_error_status_codes: NotRequired[Iterable[int]]

HTTP status codes that are typically considered errors but should be treated as successful responses.

keep_alive

keep_alive: NotRequired[bool]

Flag that can keep crawler running even when there are no requests in queue.

max_crawl_depth

max_crawl_depth: NotRequired[int | None]

Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

max_request_retries

max_request_retries: NotRequired[int]

Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.).

This limit does not apply to retries triggered by session rotation (see max_session_rotations).

max_requests_per_crawl

max_requests_per_crawl: NotRequired[int | None]

Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

max_session_rotations

max_session_rotations: NotRequired[int]

Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

The session rotations are not counted towards the max_request_retries limit.

proxy_configuration

proxy_configuration: NotRequired[ProxyConfiguration]

HTTP proxy configuration used when making requests.

request_handler

request_handler: NotRequired[Callable[[TCrawlingContext], Awaitable[None]]]

A callable responsible for handling requests.

request_handler_timeout

request_handler_timeout: NotRequired[timedelta]

Maximum duration allowed for a single request handler to run.

request_manager

request_manager: NotRequired[RequestManager]

Manager of requests that should be processed by the crawler.

respect_robots_txt_file

respect_robots_txt_file: NotRequired[bool]

If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction.

retry_on_blocked

retry_on_blocked: NotRequired[bool]

If True, the crawler attempts to bypass bot protections automatically.

session_pool

session_pool: NotRequired[SessionPool]

A custom SessionPool instance, allowing the use of non-default configuration.

statistics

statistics: NotRequired[Statistics[TStatisticsState]]

A custom Statistics instance, allowing the use of non-default configuration.

statistics_log_format

statistics_log_format: NotRequired[Literal['table', 'inline']]

If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.

status_message_callback

status_message_callback: NotRequired[ Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] ]

Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.

status_message_logging_interval

status_message_logging_interval: NotRequired[timedelta]

Interval for logging the crawler status messages.

storage_client

storage_client: NotRequired[StorageClient]

The storage client for managing storages for the crawler and all its components.

use_session_pool

use_session_pool: NotRequired[bool]

Enable the use of a session pool for managing sessions during crawling.

Hierarchy

Index

Properties

Properties

abort_on_error

additional_http_error_status_codes

browser_launch_options

browser_new_context_options

browser_pool

browser_type

concurrency_settings

configuration

configure_logging

event_manager

headless

http_client

ignore_http_error_status_codes

keep_alive

max_crawl_depth

max_request_retries

max_requests_per_crawl

max_session_rotations

proxy_configuration

request_handler

request_handler_timeout

request_manager

respect_robots_txt_file

retry_on_blocked

session_pool

statistics

statistics_log_format

status_message_callback

status_message_logging_interval

storage_client

use_session_pool