BasicCrawler
crawlee.basic_crawler._basic_crawler.BasicCrawler
Index
Errors
error_handler
Decorator for configuring an error handler (called after a request handler error and before retrying).
Parameters
handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]
Constructors
__init__
Initialize the BasicCrawler.
Parameters
request_provider: RequestProvider | None = Nonekeyword-only
request_handler: Callable[[TCrawlingContext], Awaitable[None]] | None = Nonekeyword-only
http_client: BaseHttpClient | None = Nonekeyword-only
concurrency_settings: ConcurrencySettings | None = Nonekeyword-only
max_request_retries: int = 3keyword-only
max_requests_per_crawl: int | None = Nonekeyword-only
max_session_rotations: int = 10keyword-only
configuration: Configuration | None = Nonekeyword-only
request_handler_timeout: timedelta = timedelta(minutes=1)keyword-only
session_pool: SessionPool | None = Nonekeyword-only
use_session_pool: bool = Truekeyword-only
retry_on_blocked: bool = Truekeyword-only
proxy_configuration: ProxyConfiguration | None = Nonekeyword-only
statistics: Statistics | None = Nonekeyword-only
event_manager: EventManager | None = Nonekeyword-only
configure_logging: bool = Truekeyword-only
_context_pipeline: ContextPipeline[TCrawlingContext] | None = Nonekeyword-only
_additional_context_managers: Sequence[AsyncContextManager] | None = Nonekeyword-only
_logger: logging.Logger | None = Nonekeyword-only
Returns None
Methods
add_requests
Add requests to the underlying request provider in batches.
Parameters
requests: Sequence[str | Request]
batch_size: int = 1000keyword-only
wait_time_between_batches: timedelta = timedelta(0)keyword-only
wait_for_all_requests_to_be_added: bool = Falsekeyword-only
wait_for_all_requests_to_be_added_timeout: timedelta | None = Nonekeyword-only
Returns None
export_data
Export data from a dataset.
This helper method simplifies the process of exporting data from a dataset. It opens the specified dataset and then exports the data based on the provided parameters.
Parameters
path: str | Path
content_type: Literal['json', 'csv'] | None = None
dataset_id: str | None = None
dataset_name: str | None = None
Returns None
failed_request_handler
Decorator for configuring a failed request handler (called after max retries are reached).
Parameters
handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]
get_data
Retrieve data from a dataset.
This helper method simplifies the process of retrieving data from a dataset. It opens the specified dataset and then retrieves the data based on the provided parameters.
Parameters
dataset_id: str | None = None
dataset_name: str | None = None
kwargs: Unpack[GetDataKwargs]
Returns DatasetItemsListPage
get_dataset
Return the dataset with the given ID or name. If none is provided, return the default dataset.
Parameters
id: str | None = Nonekeyword-only
name: str | None = Nonekeyword-only
Returns Dataset
get_key_value_store
Return the key-value store with the given ID or name. If none is provided, return the default KVS.
Parameters
id: str | None = Nonekeyword-only
name: str | None = Nonekeyword-only
Returns KeyValueStore
get_request_provider
Return the configured request provider. If none is configured, open and return the default request queue.
Parameters
id: str | None = Nonekeyword-only
name: str | None = Nonekeyword-only
Returns RequestProvider
router
Parameters
router: Router[TCrawlingContext]
Returns None
run
Run the crawler until all requests are processed.
Parameters
requests: Sequence[str | Request] | None = None
purge_request_queue: bool = Truekeyword-only
Returns FinalStatistics
Properties
log
The logger used by the crawler.
router
The router used to handle each individual crawling request.
statistics
Statistics about the current (or last) crawler run.
Provides a simple framework for parallel crawling of web pages.
The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites.
BasicCrawler
is a low-level tool that requires the user to implement the page download and data extraction functionality themselves. If we want a crawler that already facilitates this functionality, we should consider using one of its subclasses.