BasicCrawlerOptions

Arguments for the BasicCrawler constructor.

It is intended for typing forwarded __init__ arguments in the subclasses.

Hierarchy

_BasicCrawlerOptionsGeneric
_BasicCrawlerOptions
- BasicCrawlerOptions
  - PlaywrightCrawlerOptions

Properties

keyword-onlyoptionalabort_on_error

abort_on_error: NotRequired[bool]

If True, the crawler stops immediately when any request handler error occurs.

keyword-onlyoptionaladditional_http_error_status_codes

additional_http_error_status_codes: NotRequired[Iterable[int]]

Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

keyword-onlyoptionalconcurrency_settings

concurrency_settings: NotRequired[ConcurrencySettings]

Settings to fine-tune concurrency levels.

keyword-onlyoptionalconfiguration

configuration: NotRequired[Configuration]

The Configuration instance. Some of its properties are used as defaults for the crawler.

keyword-onlyoptionalconfigure_logging

configure_logging: NotRequired[bool]

If True, the crawler will set up logging infrastructure automatically.

keyword-onlyoptionalevent_manager

event_manager: NotRequired[EventManager]

The event manager for managing events for the crawler and all its components.

keyword-onlyoptionalhttp_client

http_client: NotRequired[HttpClient]

HTTP client used by BasicCrawlingContext.send_request method.

keyword-onlyoptionalignore_http_error_status_codes

ignore_http_error_status_codes: NotRequired[Iterable[int]]

HTTP status codes that are typically considered errors but should be treated as successful responses.

keyword-onlyoptionalkeep_alive

keep_alive: NotRequired[bool]

Flag that can keep crawler running even when there are no requests in queue.

keyword-onlyoptionalmax_crawl_depth

max_crawl_depth: NotRequired[int | None]

Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

keyword-onlyoptionalmax_request_retries

max_request_retries: NotRequired[int]

Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.).

This limit does not apply to retries triggered by session rotation (see max_session_rotations).

keyword-onlyoptionalmax_requests_per_crawl

max_requests_per_crawl: NotRequired[int | None]

Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

keyword-onlyoptionalmax_session_rotations

max_session_rotations: NotRequired[int]

Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

The session rotations are not counted towards the max_request_retries limit.

keyword-onlyoptionalproxy_configuration

proxy_configuration: NotRequired[ProxyConfiguration]

HTTP proxy configuration used when making requests.

keyword-onlyoptionalrequest_handler

request_handler: NotRequired[Callable[[TCrawlingContext], Awaitable[None]]]

A callable responsible for handling requests.

keyword-onlyoptionalrequest_handler_timeout

request_handler_timeout: NotRequired[timedelta]

Maximum duration allowed for a single request handler to run.

keyword-onlyoptionalrequest_manager

request_manager: NotRequired[RequestManager]

Manager of requests that should be processed by the crawler.

keyword-onlyoptionalrespect_robots_txt_file

respect_robots_txt_file: NotRequired[bool]

If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction.

keyword-onlyoptionalretry_on_blocked

retry_on_blocked: NotRequired[bool]

If True, the crawler attempts to bypass bot protections automatically.

keyword-onlyoptionalsession_pool

session_pool: NotRequired[SessionPool]

A custom SessionPool instance, allowing the use of non-default configuration.

keyword-onlyoptionalstatistics

statistics: NotRequired[Statistics[TStatisticsState]]

A custom Statistics instance, allowing the use of non-default configuration.

keyword-onlyoptionalstatistics_log_format

statistics_log_format: NotRequired[Literal['table', 'inline']]

If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.

keyword-onlyoptionalstatus_message_callback

status_message_callback: NotRequired[ Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] ]

Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.

keyword-onlyoptionalstatus_message_logging_interval

status_message_logging_interval: NotRequired[timedelta]

Interval for logging the crawler status messages.

keyword-onlyoptionalstorage_client

storage_client: NotRequired[StorageClient]

The storage client for managing storages for the crawler and all its components.

keyword-onlyoptionaluse_session_pool

use_session_pool: NotRequired[bool]

Enable the use of a session pool for managing sessions during crawling.

Hierarchy

Index

Properties

Properties

keyword-onlyoptionalabort_on_error

keyword-onlyoptionaladditional_http_error_status_codes

keyword-onlyoptionalconcurrency_settings

keyword-onlyoptionalconfiguration

keyword-onlyoptionalconfigure_logging

keyword-onlyoptionalevent_manager

keyword-onlyoptionalhttp_client

keyword-onlyoptionalignore_http_error_status_codes

keyword-onlyoptionalkeep_alive

keyword-onlyoptionalmax_crawl_depth

keyword-onlyoptionalmax_request_retries

keyword-onlyoptionalmax_requests_per_crawl

keyword-onlyoptionalmax_session_rotations

keyword-onlyoptionalproxy_configuration

keyword-onlyoptionalrequest_handler

keyword-onlyoptionalrequest_handler_timeout

keyword-onlyoptionalrequest_manager

keyword-onlyoptionalrespect_robots_txt_file

keyword-onlyoptionalretry_on_blocked

keyword-onlyoptionalsession_pool

keyword-onlyoptionalstatistics

keyword-onlyoptionalstatistics_log_format

keyword-onlyoptionalstatus_message_callback

keyword-onlyoptionalstatus_message_logging_interval

keyword-onlyoptionalstorage_client

keyword-onlyoptionaluse_session_pool