PlaywrightCrawler
Hierarchy
- BasicCrawler
- PlaywrightCrawler
Index
Constructors
__init__
Create a new instance.
Parameters
browser_pool: BrowserPool | None = None
A
BrowserPool
instance to be used for launching the browsers and getting pages.browser_type: BrowserType | None = None
The type of browser to launch ('chromium', 'firefox', or 'webkit'). This option should not be used if
browser_pool
is provided.headless: bool | None = None
Whether to run the browser in headless mode. This option should not be used if
browser_pool
is provided.kwargs: Unpack[BasicCrawlerOptions[PlaywrightCrawlingContext]]
Additional arguments to be forwarded to the underlying
BasicCrawler
.
Returns None
Methods
add_requests
Add requests to the underlying request provider in batches.
Parameters
requests: Sequence[str | Request]
A list of requests to add to the queue.
keyword-onlybatch_size: int = 1000
The number of requests to add in one batch.
keyword-onlywait_time_between_batches: timedelta = timedelta(0)
Time to wait between adding batches.
keyword-onlywait_for_all_requests_to_be_added: bool = False
If True, wait for all requests to be added before returning.
keyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Timeout for waiting for all requests to be added.
Returns None
error_handler
Decorator for configuring an error handler (called after a request handler error and before retrying).
Parameters
handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]
export_data
Export data from a dataset.
This helper method simplifies the process of exporting data from a dataset. It opens the specified dataset and then exports the data based on the provided parameters.
Parameters
path: str | Path
The destination path
content_type: Literal['json', 'csv'] | None = None
The output format
dataset_id: str | None = None
The ID of the dataset.
dataset_name: str | None = None
The name of the dataset.
Returns None
failed_request_handler
Decorator for configuring a failed request handler (called after max retries are reached).
Parameters
handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]
get_data
Retrieve data from a dataset.
This helper method simplifies the process of retrieving data from a dataset. It opens the specified dataset and then retrieves the data based on the provided parameters.
Parameters
dataset_id: str | None = None
The ID of the dataset.
dataset_name: str | None = None
The name of the dataset.
kwargs: Unpack[GetDataKwargs]
Keyword arguments to be passed to the dataset's
get_data
method.
Returns DatasetItemsListPage
The retrieved data.
get_dataset
Return the dataset with the given ID or name. If none is provided, return the default dataset.
Parameters
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns Dataset
get_key_value_store
Return the key-value store with the given ID or name. If none is provided, return the default KVS.
Parameters
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns KeyValueStore
get_request_provider
Return the configured request provider. If none is configured, open and return the default request queue.
Parameters
keyword-onlyid: str | None = None
keyword-onlyname: str | None = None
Returns RequestProvider
router
Parameters
router: Router[TCrawlingContext]
Returns None
run
Run the crawler until all requests are processed.
Parameters
requests: Sequence[str | Request] | None = None
The requests to be enqueued before the crawler starts
keyword-onlypurge_request_queue: bool = True
If this is
True
and the crawler is not being run for the first time, the default request queue will be purged
Returns FinalStatistics
Properties
log
The logger used by the crawler.
router
The router used to handle each individual crawling request.
statistics
Statistics about the current (or last) crawler run.
A crawler that leverages the Playwright browser automation library.
PlaywrightCrawler
is a subclass ofBasicCrawler
, inheriting all its features, such as autoscaling of requests, request routing, and utilization ofRequestProvider
. Additionally, it offers Playwright-specific methods and properties, like thepage
property for user data extraction, and theenqueue_links
method for crawling other pages.This crawler is ideal for crawling websites that require JavaScript execution, as it uses headless browsers to download web pages and extract data. For websites that do not require JavaScript, consider using
BeautifulSoupCrawler
, which uses raw HTTP requests, and it is much faster.PlaywrightCrawler
opens a new browser page (i.e., tab) for eachRequest
object and invokes the user-provided request handler function via theRouter
. Users can interact with the page and extract the data using the Playwright API.Note that the pool of browser instances used by
PlaywrightCrawler
, and the pages they open, is internally managed by theBrowserPool
.