BeautifulSoupCrawler
Hierarchy
- BasicCrawler
- BeautifulSoupCrawler
Index
Constructors
__init__
Initialize the BeautifulSoupCrawler.
Parameters
keyword-onlyparser: Literal['html.parser', 'lxml', 'xml', 'html5lib'] = 'lxml'
The type of parser that should be used by BeautifulSoup
keyword-onlyadditional_http_error_status_codes: Iterable[int] = ()
HTTP status codes that should be considered errors (and trigger a retry)
keyword-onlyignore_http_error_status_codes: Iterable[int] = ()
HTTP status codes that are normally considered errors but we want to treat them as successful
kwargs: Unpack[BasicCrawlerOptions[BeautifulSoupCrawlingContext]]
Arguments to be forwarded to the underlying BasicCrawler
Returns None
Methods
add_requests
Add requests to the underlying request provider in batches.
Parameters
requests: Sequence[str | Request]
A list of requests to add to the queue.
keyword-onlybatch_size: int = 1000
The number of requests to add in one batch.
keyword-onlywait_time_between_batches: timedelta = timedelta(0)
Time to wait between adding batches.
keyword-onlywait_for_all_requests_to_be_added: bool = False
If True, wait for all requests to be added before returning.
keyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Timeout for waiting for all requests to be added.
Returns None
error_handler
Decorator for configuring an error handler (called after a request handler error and before retrying).
Parameters
handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]
export_data
Export data from a dataset.
This helper method simplifies the process of exporting data from a dataset. It opens the specified dataset and then exports the data based on the provided parameters.
Parameters
path: str | Path
The destination path
content_type: Literal['json', 'csv'] | None = None
The output format
dataset_id: str | None = None
The ID of the dataset.
dataset_name: str | None = None
The name of the dataset.
Returns None
failed_request_handler
Decorator for configuring a failed request handler (called after max retries are reached).
Parameters
handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]
get_data
Retrieve data from a dataset.
This helper method simplifies the process of retrieving data from a dataset. It opens the specified dataset and then retrieves the data based on the provided parameters.
Parameters
dataset_id: str | None = None
The ID of the dataset.
dataset_name: str | None = None
The name of the dataset.
kwargs: Unpack[GetDataKwargs]
Keyword arguments to be passed to the dataset's
get_data
method.
Returns DatasetItemsListPage
The retrieved data.
get_dataset
Return the dataset with the given ID or name. If none is provided, return the default dataset.
Parameters
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns Dataset
get_key_value_store
Return the key-value store with the given ID or name. If none is provided, return the default KVS.
Parameters
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns KeyValueStore
get_request_provider
Return the configured request provider. If none is configured, open and return the default request queue.
Parameters
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns RequestProvider
router
Parameters
router: Router[TCrawlingContext]
Returns None
run
Run the crawler until all requests are processed.
Parameters
requests: Sequence[str | Request] | None = None
The requests to be enqueued before the crawler starts
keyword-onlypurge_request_queue: bool = True
If this is
True
and the crawler is not being run for the first time, the default request queue will be purged
Returns FinalStatistics
Properties
log
The logger used by the crawler.
router
The router used to handle each individual crawling request.
statistics
Statistics about the current (or last) crawler run.
A crawler that fetches the request URL using
httpx
and parses the result withBeautifulSoup
.