ParselCrawler
Hierarchy
- AbstractHttpCrawler
- ParselCrawler
Index
Methods
__init__
Initialize a new instance.
Parameters
keyword-onlyoptionalrequest_handler: Callable[[TCrawlingContext], Awaitable[None]]
A callable responsible for handling requests.
keyword-onlyoptionalstatistics: Statistics[TStatisticsState]
A custom
Statistics
instance, allowing the use of non-default configuration.keyword-onlyoptionalconfiguration: Configuration
The
Configuration
instance. Some of its properties are used as defaults for the crawler.keyword-onlyoptionalevent_manager: EventManager
The event manager for managing events for the crawler and all its components.
keyword-onlyoptionalstorage_client: StorageClient
The storage client for managing storages for the crawler and all its components.
keyword-onlyoptionalrequest_manager: RequestManager
Manager of requests that should be processed by the crawler.
keyword-onlyoptionalsession_pool: SessionPool
A custom
SessionPool
instance, allowing the use of non-default configuration.keyword-onlyoptionalproxy_configuration: ProxyConfiguration
HTTP proxy configuration used when making requests.
keyword-onlyoptionalhttp_client: HttpClient
HTTP client used by
BasicCrawlingContext.send_request
method.keyword-onlyoptionalmax_request_retries: int
Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (
request_handler
,pre_navigation_hooks
etc.).This limit does not apply to retries triggered by session rotation (see
max_session_rotations
).keyword-onlyoptionalmax_requests_per_crawl: int | None
Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers.
None
means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.keyword-onlyoptionalmax_session_rotations: int
Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.
The session rotations are not counted towards the
max_request_retries
limit.keyword-onlyoptionalmax_crawl_depth: int | None
Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.
keyword-onlyoptionaluse_session_pool: bool
Enable the use of a session pool for managing sessions during crawling.
keyword-onlyoptionalretry_on_blocked: bool
If True, the crawler attempts to bypass bot protections automatically.
keyword-onlyoptionalconcurrency_settings: ConcurrencySettings
Settings to fine-tune concurrency levels.
keyword-onlyoptionalrequest_handler_timeout: timedelta
Maximum duration allowed for a single request handler to run.
keyword-onlyoptionalabort_on_error: bool
If True, the crawler stops immediately when any request handler error occurs.
keyword-onlyoptionalconfigure_logging: bool
If True, the crawler will set up logging infrastructure automatically.
keyword-onlyoptionalstatistics_log_format: Literal[table, inline]
If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.
keyword-onlyoptionalkeep_alive: bool
Flag that can keep crawler running even when there are no requests in queue.
keyword-onlyoptionaladditional_http_error_status_codes: Iterable[int]
Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.
keyword-onlyoptionalignore_http_error_status_codes: Iterable[int]
HTTP status codes that are typically considered errors but should be treated as successful responses.
keyword-onlyoptionalrespect_robots_txt_file: bool
If set to
True
, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added viaEnqueueLinksFunction
.keyword-onlyoptionalstatus_message_logging_interval: timedelta
Interval for logging the crawler status messages.
keyword-onlyoptionalstatus_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
Allows overriding the default status message. The default status message is provided in the parameters. Returning
None
suppresses the status message.
Returns None
add_requests
Add requests to the underlying request manager in batches.
Parameters
requests: Sequence[str | Request]
A list of requests to add to the queue.
optionalkeyword-onlyforefront: bool = False
If True, add requests to the forefront of the queue.
optionalkeyword-onlybatch_size: int = 1000
The number of requests to add in one batch.
optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)
Time to wait between adding batches.
optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
If True, wait for all requests to be added before returning.
optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Timeout for waiting for all requests to be added.
Returns None
create_parsed_http_crawler_class
Create a specific version of
AbstractHttpCrawler
class.This is a convenience factory method for creating a specific
AbstractHttpCrawler
subclass. WhileAbstractHttpCrawler
allows its two generic parameters to be independent, this method simplifies cases whereTParseResult
is used for both generic parameters.Parameters
static_parser: AbstractHttpParser[TParseResult, TSelectResult]
Returns type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult, TSelectResult]]
error_handler
Register a function to handle errors occurring in request handlers.
The error handler is invoked after a request handler error occurs and before a retry attempt.
Parameters
handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]
export_data
Export all items from a Dataset to a JSON or CSV file.
This method simplifies the process of exporting data collected during crawling. It automatically determines the export format based on the file extension (
.json
or.csv
) and handles the conversion ofDataset
items to the appropriate format.Parameters
path: str | Path
The destination file path. Must end with '.json' or '.csv'.
optionaldataset_id: str | None = None
The ID of the Dataset to export from. If None, uses
name
parameter instead.optionaldataset_name: str | None = None
The name of the Dataset to export from. If None, uses
id
parameter instead.
Returns None
failed_request_handler
Register a function to handle requests that exceed the maximum retry limit.
The failed request handler is invoked when a request has failed all retry attempts.
Parameters
handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]
get_data
Retrieve data from a
Dataset
.This helper method simplifies the process of retrieving data from a
Dataset
. It opens the specified one and then retrieves the data based on the provided parameters.Parameters
optionaldataset_id: str | None = None
The ID of the
Dataset
.optionaldataset_name: str | None = None
The name of the
Dataset
.keyword-onlyoptionaloffset: int
Skips the specified number of items at the start.
keyword-onlyoptionallimit: int | None
The maximum number of items to retrieve. Unlimited if None.
keyword-onlyoptionalclean: bool
Return only non-empty items and excludes hidden fields. Shortcut for
skip_hidden
andskip_empty
.keyword-onlyoptionaldesc: bool
Set to True to sort results in descending order.
keyword-onlyoptionalfields: list[str]
Fields to include in each item. Sorts fields as specified if provided.
keyword-onlyoptionalomit: list[str]
Fields to exclude from each item.
keyword-onlyoptionalunwind: str
Unwinds items by a specified array field, turning each element into a separate item.
keyword-onlyoptionalskip_empty: bool
Excludes empty items from the results if True.
keyword-onlyoptionalskip_hidden: bool
Excludes fields starting with '#' if True.
keyword-onlyoptionalflatten: list[str]
Fields to be flattened in returned items.
keyword-onlyoptionalview: str
Specifies the dataset view to be used.
Returns DatasetItemsListPage
get_dataset
Return the
Dataset
with the given ID or name. If none is provided, return the default one.Parameters
optionalkeyword-onlyid: str | None = None
optionalkeyword-onlyname: str | None = None
Returns Dataset
get_key_value_store
Return the
KeyValueStore
with the given ID or name. If none is provided, return the default KVS.Parameters
optionalkeyword-onlyid: str | None = None
optionalkeyword-onlyname: str | None = None
Returns KeyValueStore
get_request_manager
Return the configured request manager. If none is configured, open and return the default request queue.
Returns RequestManager
on_skipped_request
Register a function to handle skipped requests.
The skipped request handler is invoked when a request is skipped due to a collision or other reasons.
Parameters
callback: SkippedRequestCallback
Returns SkippedRequestCallback
pre_navigation_hook
Register a hook to be called before each navigation.
Parameters
hook: Callable[[BasicCrawlingContext], Awaitable[None]]
A coroutine function to be called before each navigation.
Returns None
run
Run the crawler until all requests are processed.
Parameters
optionalrequests: Sequence[str | Request] | None = None
The requests to be enqueued before the crawler starts.
optionalkeyword-onlypurge_request_queue: bool = True
If this is
True
and the crawler is not being run for the first time, the default request queue will be purged.
Returns FinalStatistics
stop
Set flag to stop crawler.
This stops current crawler run regardless of whether all requests were finished.
Parameters
optionalreason: str = 'Stop was called externally.'
Reason for stopping that will be used in logs.
Returns None
Properties
log
The logger used by the crawler.
router
The Router
used to handle each individual crawling request.
statistics
Statistics about the current (or last) crawler run.
A web crawler for performing HTTP requests and parsing HTML/XML content.
The
ParselCrawler
builds on top of theAbstractHttpCrawler
, which means it inherits all of its features. It specifies its own parserParselParser
which is used to parseHttpResponse
.ParselParser
uses following library for parsing: https://pypi.org/project/parsel/The HTTP client-based crawlers are ideal for websites that do not require JavaScript execution. However, if you need to execute client-side JavaScript, consider using browser-based crawler like the
PlaywrightCrawler
.Usage