Skip to main content

PlaywrightCrawler

A crawler that leverages the Playwright browser automation library.

PlaywrightCrawler is a subclass of BasicCrawler, inheriting all its features, such as autoscaling of requests, request routing, and utilization of RequestProvider. Additionally, it offers Playwright-specific methods and properties, like the page property for user data extraction, and the enqueue_links method for crawling other pages.

This crawler is ideal for crawling websites that require JavaScript execution, as it uses headless browsers to download web pages and extract data. For websites that do not require JavaScript, consider using BeautifulSoupCrawler, which uses raw HTTP requests, and it is much faster.

PlaywrightCrawler opens a new browser page (i.e., tab) for each Request object and invokes the user-provided request handler function via the Router. Users can interact with the page and extract the data using the Playwright API.

Note that the pool of browser instances used by PlaywrightCrawler, and the pages they open, is internally managed by the BrowserPool.

Hierarchy

Index

Constructors

__init__

  • __init__(browser_pool, browser_type, headless, kwargs): None
  • Create a new instance.


    Parameters

    • browser_pool: BrowserPool | None = None

      A BrowserPool instance to be used for launching the browsers and getting pages.

    • browser_type: BrowserType | None = None

      The type of browser to launch ('chromium', 'firefox', or 'webkit'). This option should not be used if browser_pool is provided.

    • headless: bool | None = None

      Whether to run the browser in headless mode. This option should not be used if browser_pool is provided.

    • kwargs: Unpack[BasicCrawlerOptions[PlaywrightCrawlingContext]]

      Additional arguments to be forwarded to the underlying BasicCrawler.

    Returns None

Methods

add_requests

  • async add_requests(requests, *, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying request provider in batches.


    Parameters

    • requests: Sequence[str | Request]

      A list of requests to add to the queue.

    • keyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • keyword-onlywait_time_between_batches: timedelta = timedelta(0)

      Time to wait between adding batches.

    • keyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • keyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

error_handler

  • error_handler(handler): ErrorHandler[TCrawlingContext]
  • Decorator for configuring an error handler (called after a request handler error and before retrying).


    Parameters

    • handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]

    Returns ErrorHandler[TCrawlingContext]

export_data

  • async export_data(path, content_type, dataset_id, dataset_name): None
  • Export data from a dataset.

    This helper method simplifies the process of exporting data from a dataset. It opens the specified dataset and then exports the data based on the provided parameters.


    Parameters

    • path: str | Path

      The destination path

    • content_type: Literal['json', 'csv'] | None = None

      The output format

    • dataset_id: str | None = None

      The ID of the dataset.

    • dataset_name: str | None = None

      The name of the dataset.

    Returns None

failed_request_handler

  • failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]
  • Decorator for configuring a failed request handler (called after max retries are reached).


    Parameters

    • handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]

    Returns FailedRequestHandler[TCrawlingContext]

get_data

  • Retrieve data from a dataset.

    This helper method simplifies the process of retrieving data from a dataset. It opens the specified dataset and then retrieves the data based on the provided parameters.


    Parameters

    • dataset_id: str | None = None

      The ID of the dataset.

    • dataset_name: str | None = None

      The name of the dataset.

    • kwargs: Unpack[GetDataKwargs]

      Keyword arguments to be passed to the dataset's get_data method.

    Returns DatasetItemsListPage

    The retrieved data.

get_dataset

  • async get_dataset(*, id, name): Dataset
  • Return the dataset with the given ID or name. If none is provided, return the default dataset.


    Parameters

    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns Dataset

get_key_value_store

  • Return the key-value store with the given ID or name. If none is provided, return the default KVS.


    Parameters

    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns KeyValueStore

get_request_provider

  • Return the configured request provider. If none is configured, open and return the default request queue.


    Parameters

    • keyword-onlyid: str | None = None
    • keyword-onlyname: str | None = None

    Returns RequestProvider

router

  • router(router): None
  • Parameters

    • router: Router[TCrawlingContext]

    Returns None

run

  • Run the crawler until all requests are processed.


    Parameters

    • requests: Sequence[str | Request] | None = None

      The requests to be enqueued before the crawler starts

    • keyword-onlypurge_request_queue: bool = True

      If this is True and the crawler is not being run for the first time, the default request queue will be purged

    Returns FinalStatistics

Properties

log

log: logging.Logger

The logger used by the crawler.

router

router: Router[TCrawlingContext]

The router used to handle each individual crawling request.

statistics

statistics: Statistics[StatisticsState]

Statistics about the current (or last) crawler run.