HttpCrawlerOptions
Hierarchy
- BasicCrawlerOptions
- HttpCrawlerOptions
Index
Properties
- abort_on_error
- additional_http_error_status_codes
- concurrency_settings
- configuration
- configure_logging
- event_manager
- http_client
- ignore_http_error_status_codes
- max_crawl_depth
- max_request_retries
- max_requests_per_crawl
- max_session_rotations
- proxy_configuration
- request_handler
- request_handler_timeout
- request_manager
- retry_on_blocked
- session_pool
- statistics
- storage_client
- use_session_pool
Properties
abort_on_error
If True, the crawler stops immediately when any request handler error occurs.
additional_http_error_status_codes
Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
concurrency_settings
Settings to fine-tune concurrency levels.
configuration
The configuration object. Some of its properties are used as defaults for the crawler.
configure_logging
If True, the crawler will set up logging infrastructure automatically.
event_manager
The event manager for managing events for the crawler and all its components.
http_client
HTTP client used by BasicCrawlingContext.send_request
method.
ignore_http_error_status_codes
HTTP status codes typically considered errors but to be treated as successful responses.
max_crawl_depth
Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
max_request_retries
Maximum number of attempts to process a single request.
max_requests_per_crawl
Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit.
Setting this value can help avoid infinite loops in misconfigured crawlers. None
means no limit.
Due to concurrency settings, the actual number of pages visited may slightly exceed this value.
max_session_rotations
Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
proxy_configuration
HTTP proxy configuration used when making requests.
request_handler
A callable responsible for handling requests.
request_handler_timeout
Maximum duration allowed for a single request handler to run.
request_manager
Manager of requests that should be processed by the crawler.
retry_on_blocked
If True, the crawler attempts to bypass bot protections automatically.
session_pool
A custom SessionPool
instance, allowing the use of non-default configuration.
statistics
A custom Statistics
instance, allowing the use of non-default configuration.
storage_client
The storage client for managing storages for the crawler and all its components.
use_session_pool
Enable the use of a session pool for managing sessions during crawling.
Arguments for the
AbstractHttpCrawler
constructor.It is intended for typing forwarded
__init__
arguments in the subclasses.