Skip to main content
Version: Next

StagehandCrawler

A web crawler that integrates Stagehand AI-powered browser automation with Crawlee.

StagehandCrawler builds on top of PlaywrightCrawler, inheriting all of its features. It uses StagehandBrowserPlugin to manage Stagehand sessions. Stagehand creates and manages the browser instance - either locally via a bundled Chromium binary, or remotely via Browserbase cloud - and Playwright connects to it via the Chrome DevTools Protocol (CDP).

Because Stagehand relies on CDP, only Chromium is supported. Not all Playwright browser and context configuration options are available - browser settings are limited to the subset accepted by Stagehand's BrowserLaunchOptions (such as headless, args, viewport, proxy, locale, and executable_path). Full browser fingerprinting (canvas, WebGL, screen properties) and incognito pages are not supported; fingerprint-consistent HTTP headers (User-Agent, Accept, sec-ch-ua) are still injected automatically.

Each page in the crawling context is a StagehandPage, which extends the standard Playwright Page with the following AI methods:

  • page.act(**kwargs) - perform an action on the page using natural language
  • page.extract(**kwargs) - extract structured data from the page with AI
  • page.observe(**kwargs) - get AI-suggested actions available on the page
  • page.execute(**kwargs) - run an autonomous multi-step agent

Stagehand configuration (model, API key, environment) is provided via stagehand_options. By default, the crawler runs locally using the openai/gpt-5.4-nano model.

Usage

import asyncio
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext
from crawlee.browsers import StagehandOptions

crawler = StagehandCrawler(
stagehand_options=StagehandOptions(
model_api_key='sk-...',
model='openai/gpt-5.4-nano',
),
)

@crawler.router.default_handler
async def handler(context: StagehandCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Use standard Playwright methods alongside AI methods.
await context.page.act(input='Click the accept cookies button if present')

data = await context.page.extract(instruction='Get the article title and author')

await context.push_data(data)

asyncio.run(crawler.run(['https://example.com']))

Hierarchy

Index

Methods

__init__

  • __init__(*, stagehand_options, browser_pool, user_data_dir, headless, browser_launch_options, browser_new_context_options, goto_options, navigation_timeout, request_handler, statistics, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, concurrency_settings, request_handler_timeout, abort_on_error, configure_logging, statistics_log_format, keep_alive, additional_http_error_status_codes, ignore_http_error_status_codes, respect_robots_txt_file, status_message_logging_interval, status_message_callback, id): None
  • Initialize a new instance.


    Parameters

    • optionalkeyword-onlystagehand_options: StagehandOptions | None = None

      Stagehand-specific configuration (model, API key, env, etc.). Cannot be specified if browser_pool is provided.

    • optionalkeyword-onlybrowser_pool: BrowserPool | None = None

      A pre-configured BrowserPool. All plugins must be instances of StagehandBrowserPlugin. If omitted, a pool is created automatically from the other browser arguments.

    • optionalkeyword-onlyuser_data_dir: (str | Path) | None = None

      Path to a user data directory, which stores browser session data like cookies and local storage. Cannot be specified if browser_pool is provided.

    • optionalkeyword-onlyheadless: bool | None = None

      Whether to run the browser in headless mode. Defaults to the value from Crawlee's global Configuration. Cannot be specified if browser_pool is provided.

    • optionalkeyword-onlybrowser_launch_options: dict[str, Any] | None = None

      Keyword arguments for browser launch passed to Stagehand's BrowserLaunchOptions (a subset of Playwright's launch options). Supported keys include args, executable_path, proxy, viewport, locale, and others. Cannot be specified if browser_pool is provided.

    • optionalkeyword-onlybrowser_new_context_options: dict[str, Any] | None = None

      Keyword arguments for browser context creation, merged with browser_launch_options. Options that map to BrowserLaunchOptions take effect on the first page; subsequent pages reuse the existing session context. Cannot be specified if browser_pool is provided.

    • optionalkeyword-onlygoto_options: GotoOptions | None = None

      Additional options passed to Stagehand's Page.goto(). The timeout option is not supported - use navigation_timeout instead.

    • optionalkeyword-onlynavigation_timeout: timedelta | None = None

      Timeout for the navigation phase (from opening the page to calling the request handler). Defaults to one minute.

    • keyword-onlyoptionalrequest_handler: NotRequired[Callable[[TCrawlingContext], Awaitable[None]]]

      A callable responsible for handling requests.

    • keyword-onlyoptionalstatistics: NotRequired[Statistics[TStatisticsState]]

      A custom Statistics instance, allowing the use of non-default configuration.

    • keyword-onlyoptionalconfiguration: NotRequired[Configuration]

      The Configuration instance. Some of its properties are used as defaults for the crawler.

    • keyword-onlyoptionalevent_manager: NotRequired[EventManager]

      The event manager for managing events for the crawler and all its components.

    • keyword-onlyoptionalstorage_client: NotRequired[StorageClient]

      The storage client for managing storages for the crawler and all its components.

    • keyword-onlyoptionalrequest_manager: NotRequired[RequestManager]

      Manager of requests that should be processed by the crawler.

    • keyword-onlyoptionalsession_pool: NotRequired[SessionPool]

      A custom SessionPool instance, allowing the use of non-default configuration.

    • keyword-onlyoptionalproxy_configuration: NotRequired[ProxyConfiguration]

      HTTP proxy configuration used when making requests.

    • keyword-onlyoptionalhttp_client: NotRequired[HttpClient]

      HTTP client used by BasicCrawlingContext.send_request method.

    • keyword-onlyoptionalmax_request_retries: NotRequired[int]

      Specifies the maximum number of retries allowed for a request if its processing fails. This includes retries due to navigation errors or errors thrown from user-supplied functions (request_handler, pre_navigation_hooks etc.).

      This limit does not apply to retries triggered by session rotation (see max_session_rotations).

    • keyword-onlyoptionalmax_requests_per_crawl: NotRequired[int | None]

      Maximum number of pages to open during a crawl. The crawl stops upon reaching this limit. Setting this value can help avoid infinite loops in misconfigured crawlers. None means no limit. Due to concurrency settings, the actual number of pages visited may slightly exceed this value.

    • keyword-onlyoptionalmax_session_rotations: NotRequired[int]

      Maximum number of session rotations per request. The crawler rotates the session if a proxy error occurs or if the website blocks the request.

      The session rotations are not counted towards the max_request_retries limit.

    • keyword-onlyoptionalmax_crawl_depth: NotRequired[int | None]

      Specifies the maximum crawl depth. If set, the crawler will stop processing links beyond this depth. The crawl depth starts at 0 for initial requests and increases with each subsequent level of links. Requests at the maximum depth will still be processed, but no new links will be enqueued from those requests. If not set, crawling continues without depth restrictions.

    • keyword-onlyoptionaluse_session_pool: NotRequired[bool]

      Enable the use of a session pool for managing sessions during crawling.

    • keyword-onlyoptionalretry_on_blocked: NotRequired[bool]

      If True, the crawler attempts to bypass bot protections automatically.

    • keyword-onlyoptionalconcurrency_settings: NotRequired[ConcurrencySettings]

      Settings to fine-tune concurrency levels.

    • keyword-onlyoptionalrequest_handler_timeout: NotRequired[timedelta]

      Maximum duration allowed for a single request handler to run.

    • keyword-onlyoptionalabort_on_error: NotRequired[bool]

      If True, the crawler stops immediately when any request handler error occurs.

    • keyword-onlyoptionalconfigure_logging: NotRequired[bool]

      If True, the crawler will set up logging infrastructure automatically.

    • keyword-onlyoptionalstatistics_log_format: NotRequired[Literal['table', 'inline']]

      If 'table', displays crawler statistics as formatted tables in logs. If 'inline', outputs statistics as plain text log messages.

    • keyword-onlyoptionalkeep_alive: NotRequired[bool]

      Flag that can keep crawler running even when there are no requests in queue.

    • keyword-onlyoptionaladditional_http_error_status_codes: NotRequired[Iterable[int]]

      Additional HTTP status codes to treat as errors, triggering automatic retries when encountered.

    • keyword-onlyoptionalignore_http_error_status_codes: NotRequired[Iterable[int]]

      HTTP status codes that are typically considered errors but should be treated as successful responses.

    • keyword-onlyoptionalrespect_robots_txt_file: NotRequired[bool]

      If set to True, the crawler will automatically try to fetch the robots.txt file for each domain, and skip those that are not allowed. This also prevents disallowed URLs to be added via EnqueueLinksFunction.

    • keyword-onlyoptionalstatus_message_logging_interval: NotRequired[timedelta]

      Interval for logging the crawler status messages.

    • keyword-onlyoptionalstatus_message_callback: NotRequired[ Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] ]

      Allows overriding the default status message. The default status message is provided in the parameters. Returning None suppresses the status message.

    • keyword-onlyoptionalid: NotRequired[int]

      Identifier used for crawler state tracking. Use the same id across multiple crawlers to share state between them.

    Returns None

add_requests

  • async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the underlying request manager in batches.


    Parameters

    • requests: Sequence[str | Request]

      A list of requests to add to the queue.

    • optionalkeyword-onlyforefront: bool = False

      If True, add requests to the forefront of the queue.

    • optionalkeyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)

      Time to wait between adding batches.

    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

error_handler

  • error_handler(handler): ErrorHandler[TCrawlingContext]
  • Register a function to handle errors occurring in request handlers.

    The error handler is invoked after a request handler error occurs and before a retry attempt.


    Parameters

    • handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]

    Returns ErrorHandler[TCrawlingContext]

export_data

  • async export_data(path, dataset_id, dataset_name, dataset_alias, additional_kwargs): None
  • Export all items from a Dataset to a JSON or CSV file.

    This method simplifies the process of exporting data collected during crawling. It automatically determines the export format based on the file extension (.json or .csv) and handles the conversion of Dataset items to the appropriate format.


    Parameters

    • path: str | Path

      The destination file path. Must end with '.json' or '.csv'.

    • optionaldataset_id: str | None = None

      The ID of the Dataset to export from.

    • optionaldataset_name: str | None = None

      The name of the Dataset to export from (global scope, named storage).

    • optionaldataset_alias: str | None = None

      The alias of the Dataset to export from (run scope, unnamed storage).

    • additional_kwargs: Unpack[ExportDataKwargs]

      Extra keyword arguments forwarded to the JSON/CSV exporter depending on the file format.

    Returns None

failed_request_handler

  • failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]
  • Register a function to handle requests that exceed the maximum retry limit.

    The failed request handler is invoked when a request has failed all retry attempts.


    Parameters

    • handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]

    Returns FailedRequestHandler[TCrawlingContext]

get_data

  • Retrieve data from a Dataset.

    This helper method simplifies the process of retrieving data from a Dataset. It opens the specified one and then retrieves the data based on the provided parameters.


    Parameters

    • optionaldataset_id: str | None = None

      The ID of the Dataset.

    • optionaldataset_name: str | None = None

      The name of the Dataset (global scope, named storage).

    • optionaldataset_alias: str | None = None

      The alias of the Dataset (run scope, unnamed storage).

    • kwargs: Unpack[GetDataKwargs]

      Keyword arguments to be passed to the Dataset.get_data() method.

    Returns DatasetItemsListPage

    The retrieved data.

get_dataset

  • async get_dataset(*, id, name, alias): Dataset
  • Return the Dataset with the given ID or name. If none is provided, return the default one.


    Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None
    • optionalkeyword-onlyalias: str | None = None

    Returns Dataset

get_key_value_store

  • Return the KeyValueStore with the given ID or name. If none is provided, return the default KVS.


    Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None
    • optionalkeyword-onlyalias: str | None = None

    Returns KeyValueStore

get_request_manager

on_skipped_request

post_navigation_hook

  • post_navigation_hook(hook): None
  • Register a hook to be called after each navigation.


    Parameters

    • hook: Callable[[TPostNavContext], Awaitable[None]]

      A coroutine function to be called after each navigation.

    Returns None

pre_navigation_hook

  • pre_navigation_hook(hook): None
  • Register a hook to be called before each navigation.


    Parameters

    • hook: Callable[[TPreNavContext], Awaitable[None]]

      A coroutine function to be called before each navigation.

    Returns None

run

  • Run the crawler until all requests are processed.


    Parameters

    • optionalrequests: Sequence[str | Request] | None = None

      The requests to be enqueued before the crawler starts.

    • optionalkeyword-onlypurge_request_queue: bool = True

      If this is True and the crawler is not being run for the first time, the default request queue will be purged.

    Returns FinalStatistics

stop

  • stop(reason): None
  • Set flag to stop crawler.

    This stops current crawler run regardless of whether all requests were finished.


    Parameters

    • optionalreason: str = 'Stop was called externally.'

      Reason for stopping that will be used in logs.

    Returns None

use_state

  • async use_state(default_value): dict[str, JsonSerializable]
  • Parameters

    • optionaldefault_value: dict[str, JsonSerializable] | None = None

    Returns dict[str, JsonSerializable]

Properties

log

log: logging.Logger

The logger used by the crawler.

router

router: Router[TCrawlingContext]

The Router used to handle each individual crawling request.

statistics

statistics: Statistics[TStatisticsState]

Statistics about the current (or last) crawler run.