Skip to main content

BasicCrawlerOptions

Arguments for the BasicCrawler constructor.

It is intended for typing forwarded __init__ arguments in the subclasses.

Index

Properties

concurrency_settings

concurrency_settings: NotRequired[ConcurrencySettings]

Settings to fine-tune concurrency levels.

configuration

configuration: NotRequired[Configuration]

Crawler configuration.

configure_logging

configure_logging: NotRequired[bool]

If True, the crawler will set up logging infrastructure automatically.

event_manager

event_manager: NotRequired[EventManager]

A custom EventManager instance, allowing the use of non-default configuration.

http_client

http_client: NotRequired[BaseHttpClient]

HTTP client used by BasicCrawlingContext.send_request and the HTTP-based crawling.

max_crawl_depth

max_crawl_depth: NotRequired[int | None]

Limits crawl depth from 0 (initial requests) up to the specified max_crawl_depth. Requests at the maximum depth are processed, but no further links are enqueued.

max_request_retries

max_request_retries: NotRequired[int]

Maximum number of attempts to process a single request.

max_requests_per_crawl

max_requests_per_crawl: NotRequired[int | None]

Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

max_session_rotations

max_session_rotations: NotRequired[int]

Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

proxy_configuration

proxy_configuration: NotRequired[ProxyConfiguration]

HTTP proxy configuration used when making requests.

request_handler

request_handler: NotRequired[Callable[[TCrawlingContext], Awaitable[None]]]

A callable responsible for handling requests.

request_handler_timeout

request_handler_timeout: NotRequired[timedelta]

Maximum duration allowed for a single request handler to run.

request_provider

request_provider: NotRequired[RequestProvider]

Provider for requests to be processed by the crawler.

retry_on_blocked

retry_on_blocked: NotRequired[bool]

If True, the crawler attempts to bypass bot protections automatically.

session_pool

session_pool: NotRequired[SessionPool]

A custom SessionPool instance, allowing the use of non-default configuration.

statistics

statistics: NotRequired[Statistics[StatisticsState]]

A custom Statistics instance, allowing the use of non-default configuration.

use_session_pool

use_session_pool: NotRequired[bool]

Enable the use of a session pool for managing sessions during crawling.