Skip to main content
Version: Next

Playwright crawler

A PlaywrightCrawler is a browser-based crawler. In contrast to HTTP-based crawlers like ParselCrawler or BeautifulSoupCrawler, it uses a real browser to render pages and extract data. It is built on top of the Playwright browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.

When to use Playwright crawler

Use PlaywrightCrawler in scenarios that require full browser capabilities, such as:

  • Dynamic content rendering: Required when pages rely on heavy JavaScript to load or modify content in the browser.
  • Anti-scraping protection: Helpful for sites using JavaScript-based security or advanced anti-automation measures.
  • Complex cookie management: Necessary for sites with session or cookie requirements that standard HTTP-based crawlers cannot handle easily.

If HTTP-based crawlers are insufficient, PlaywrightCrawler can address these challenges. See a basic example for a typical usage demonstration.

Advanced configuration

The PlaywrightCrawler uses other Crawlee components under the hood, notably BrowserPool and PlaywrightBrowserPlugin. These components let you to configure the browser and context settings, launch multiple browsers, and apply pre-navigation hooks. You can create your own instances of these components and pass them to the PlaywrightCrawler constructor.

Managing multiple browsers

The BrowserPool allows you to manage multiple browsers. Each browser instance is managed by a separate PlaywrightBrowserPlugin and can be configured independently. This is useful for scenarios like testing multiple configurations or implementing browser rotation to help avoid blocks or detect different site behaviors.

Run on
import asyncio

from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
# Create a plugin for each required browser.
plugin_chromium = PlaywrightBrowserPlugin(
browser_type='chromium', max_open_pages_per_browser=1
)
plugin_firefox = PlaywrightBrowserPlugin(
browser_type='firefox', max_open_pages_per_browser=1
)

crawler = PlaywrightCrawler(
browser_pool=BrowserPool(plugins=[plugin_chromium, plugin_firefox]),
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
browser_name = (
context.page.context.browser.browser_type.name
if context.page.context.browser
else 'undefined'
)
context.log.info(f'Processing {context.request.url} with {browser_name} ...')

await context.enqueue_links()

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev', 'https://apify.com/'])


if __name__ == '__main__':
asyncio.run(main())

Browser launch and context configuration

The PlaywrightBrowserPlugin provides access to all relevant Playwright configuration options for both browser launches and new browser contexts. You can specify these options in the constructor of PlaywrightBrowserPlugin or PlaywrightCrawler:

Run on
import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
crawler = PlaywrightCrawler(
headless=False,
browser_type='chromium',
# Browser launch options
browser_launch_options={
# For support `msedge` channel you need to install it
# `playwright install msedge`
'channel': 'msedge',
'slow_mo': 200,
},
# Context launch options, applied to each page as it is created
browser_new_context_options={
'color_scheme': 'dark',
# Set headers
'extra_http_headers': {
'Custom-Header': 'my-header',
'Accept-Language': 'en',
},
# Set only User Agent
'user_agent': 'My-User-Agent',
},
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await context.enqueue_links()

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

You can also configure each plugin used by BrowserPool:

import asyncio

from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler


async def main() -> None:
crawler = PlaywrightCrawler(
browser_pool=BrowserPool(
plugins=[
PlaywrightBrowserPlugin(
browser_type='chromium',
browser_launch_options={
'headless': False,
'channel': 'msedge',
'slow_mo': 200,
},
browser_new_context_options={
'color_scheme': 'dark',
'extra_http_headers': {
'Custom-Header': 'my-header',
'Accept-Language': 'en',
},
'user_agent': 'My-User-Agent',
},
)
]
)
)

# ...


if __name__ == '__main__':
asyncio.run(main())

For an example of how to implement a custom browser plugin, see the Camoufox example. Camoufox is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with PlaywrightCrawler.

Page configuration with lifecycle page hooks

For additional setup or event-driven actions around page creation and closure, the BrowserPool exposes four lifecycle hooks: pre_page_create_hook, post_page_create_hook, pre_page_close_hook, and post_page_close_hook. To use them, create a BrowserPool instance and pass it to PlaywrightCrawler via the browser_pool argument.

Run on
from __future__ import annotations

import asyncio
import logging
from typing import TYPE_CHECKING, Any

from crawlee.browsers import BrowserPool
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import KeyValueStore

if TYPE_CHECKING:
from crawlee.browsers._browser_controller import BrowserController
from crawlee.browsers._types import CrawleePage
from crawlee.proxy_configuration import ProxyInfo

logger = logging.getLogger(__name__)


async def main() -> None:
async with BrowserPool() as browser_pool:

@browser_pool.pre_page_create_hook
async def log_page_init(
page_id: str,
_browser_controller: BrowserController,
_browser_new_context_options: dict[str, Any],
_proxy_info: ProxyInfo | None,
) -> None:
"""Log when a new page is about to be created."""
logger.info(f'Creating page {page_id}...')

@browser_pool.post_page_create_hook
async def set_viewport(
crawlee_page: CrawleePage, _browser_controller: BrowserController
) -> None:
"""Set a fixed viewport size on each newly created page."""
await crawlee_page.page.set_viewport_size({'width': 1280, 'height': 1024})

@browser_pool.pre_page_close_hook
async def save_screenshot(
crawlee_page: CrawleePage, _browser_controller: BrowserController
) -> None:
"""Save a screenshot to KeyValueStore before each page is closed."""
kvs = await KeyValueStore.open()

screenshot = await crawlee_page.page.screenshot()
await kvs.set_value(
key=f'screenshot-{crawlee_page.id}',
value=screenshot,
content_type='image/png',
)
logger.info(f'Saved screenshot for page {crawlee_page.id}.')

@browser_pool.post_page_close_hook
async def log_page_closed(
page_id: str, _browser_controller: BrowserController
) -> None:
"""Log after each page is closed."""
logger.info(f'Page {page_id} closed successfully.')

crawler = PlaywrightCrawler(
browser_pool=browser_pool,
max_requests_per_crawl=5,
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await context.enqueue_links()

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Navigation hooks allow for additional configuration at specific points during page navigation. The pre_navigation_hook is called before each navigation and provides PlaywrightPreNavCrawlingContext - including the page instance and a block_requests helper for filtering unwanted resource types and URL patterns. See the block requests example for a dedicated walkthrough. Similarly, the post_navigation_hook is called after each navigation and provides PlaywrightPostNavCrawlingContext - useful for post-load checks such as detecting CAPTCHAs or verifying page state.

Run on
import asyncio

from crawlee.crawlers import (
PlaywrightCrawler,
PlaywrightCrawlingContext,
PlaywrightPostNavCrawlingContext,
PlaywrightPreNavCrawlingContext,
)
from crawlee.errors import SessionError


async def main() -> None:
crawler = PlaywrightCrawler(max_requests_per_crawl=10)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await context.enqueue_links()

@crawler.pre_navigation_hook
async def configure_page(context: PlaywrightPreNavCrawlingContext) -> None:
context.log.info(f'Navigating to {context.request.url} ...')

# block stylesheets, images, fonts and other static assets
# to speed up page loading
await context.block_requests()

@crawler.post_navigation_hook
async def custom_captcha_check(context: PlaywrightPostNavCrawlingContext) -> None:
# check if the page contains a captcha
captcha_element = context.page.locator('input[name="captcha"]').first
if await captcha_element.is_visible():
context.log.warning('Captcha detected! Skipping the page.')
raise SessionError('Captcha detected')

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Conclusion

This guide introduced the PlaywrightCrawler and explained how to configure it using BrowserPool and PlaywrightBrowserPlugin. You learned how to launch multiple browsers, configure browser and context settings, use BrowserPool lifecycle page hooks, and apply navigation hooks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!