Version: Next

Playwright crawler

A PlaywrightCrawler is a browser-based crawler. In contrast to HTTP-based crawlers like ParselCrawler or BeautifulSoupCrawler, it uses a real browser to render pages and extract data. It is built on top of the Playwright browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.

When to use Playwright crawler

Use PlaywrightCrawler in scenarios that require full browser capabilities, such as:

Dynamic content rendering: Required when pages rely on heavy JavaScript to load or modify content in the browser.
Anti-scraping protection: Helpful for sites using JavaScript-based security or advanced anti-automation measures.
Complex cookie management: Necessary for sites with session or cookie requirements that standard HTTP-based crawlers cannot handle easily.

If HTTP-based crawlers are insufficient, PlaywrightCrawler can address these challenges. See a basic example for a typical usage demonstration.

Advanced configuration

The PlaywrightCrawler uses other Crawlee components under the hood, notably BrowserPool and PlaywrightBrowserPlugin. These components let you to configure the browser and context settings, launch multiple browsers, and apply pre-navigation hooks. You can create your own instances of these components and pass them to the PlaywrightCrawler constructor.

The PlaywrightBrowserPlugin manages how browsers are launched and how browser contexts are created. It accepts browser launch and new context options.
The BrowserPool manages the lifecycle of browser instances (launching, recycling, etc.). You can customize its behavior to suit your needs.

Managing multiple browsers

The BrowserPool allows you to manage multiple browsers. Each browser instance is managed by a separate PlaywrightBrowserPlugin and can be configured independently. This is useful for scenarios like testing multiple configurations or implementing browser rotation to help avoid blocks or detect different site behaviors.

Run on

import asyncio

from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    # Create a plugin for each required browser.
    plugin_chromium = PlaywrightBrowserPlugin(
        browser_type='chromium', max_open_pages_per_browser=1
    )
    plugin_firefox = PlaywrightBrowserPlugin(
        browser_type='firefox', max_open_pages_per_browser=1
    )

    crawler = PlaywrightCrawler(
        browser_pool=BrowserPool(plugins=[plugin_chromium, plugin_firefox]),
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        browser_name = (
            context.page.context.browser.browser_type.name
            if context.page.context.browser
            else 'undefined'
        )
        context.log.info(f'Processing {context.request.url} with {browser_name} ...')

        await context.enqueue_links()

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev', 'https://apify.com/'])


if __name__ == '__main__':
    asyncio.run(main())

Browser launch and context configuration

The PlaywrightBrowserPlugin provides access to all relevant Playwright configuration options for both browser launches and new browser contexts. You can specify these options in the constructor of PlaywrightBrowserPlugin or PlaywrightCrawler:

Run on

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler(
        headless=False,
        browser_type='chromium',
        # Browser launch options
        browser_launch_options={
            # For support `msedge` channel you need to install it
            # `playwright install msedge`
            'channel': 'msedge',
            'slow_mo': 200,
        },
        # Context launch options, applied to each page as it is created
        browser_new_context_options={
            'color_scheme': 'dark',
            # Set headers
            'extra_http_headers': {
                'Custom-Header': 'my-header',
                'Accept-Language': 'en',
            },
            # Set only User Agent
            'user_agent': 'My-User-Agent',
        },
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=10,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        await context.enqueue_links()

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

You can also configure each plugin used by BrowserPool:

import asyncio

from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler


async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=BrowserPool(
            plugins=[
                PlaywrightBrowserPlugin(
                    browser_type='chromium',
                    browser_launch_options={
                        'headless': False,
                        'channel': 'msedge',
                        'slow_mo': 200,
                    },
                    browser_new_context_options={
                        'color_scheme': 'dark',
                        'extra_http_headers': {
                            'Custom-Header': 'my-header',
                            'Accept-Language': 'en',
                        },
                        'user_agent': 'My-User-Agent',
                    },
                )
            ]
        )
    )

    # ...


if __name__ == '__main__':
    asyncio.run(main())

For an example of how to implement a custom browser plugin, see the Camoufox example. Camoufox is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with PlaywrightCrawler.

Browser pool lifecycle hooks

The BrowserPool exposes lifecycle hooks for both browser launches and page creation/closure. To use them, create a BrowserPool instance and pass it to PlaywrightCrawler via the browser_pool argument.

Browser launch hooks

The pre_launch_hook and post_launch_hook are called once per browser instance, before and after it is launched. Use them for logging, metrics, or any setup at the browser level. Note that these hooks are not called when a new page is created in an already-running browser.

Run on

from __future__ import annotations

import asyncio
import logging
from typing import TYPE_CHECKING

from crawlee.browsers import BrowserPool
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

if TYPE_CHECKING:
    from crawlee.browsers._browser_controller import BrowserController
    from crawlee.browsers._browser_plugin import BrowserPlugin

logger = logging.getLogger(__name__)


async def main() -> None:
    async with BrowserPool() as browser_pool:

        @browser_pool.pre_launch_hook
        async def log_browser_launch(page_id: str, plugin: BrowserPlugin) -> None:
            """Log before a new browser instance is launched."""
            logger.info(f'Launching {plugin.browser_type} browser for page {page_id}...')

        @browser_pool.post_launch_hook
        async def log_browser_launched(
            page_id: str, controller: BrowserController
        ) -> None:
            """Log after a new browser instance has been launched."""
            logger.info(f'Browser launched for page {page_id}, controller: {controller}')

        crawler = PlaywrightCrawler(
            browser_pool=browser_pool,
            max_requests_per_crawl=5,
        )

        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            context.log.info(f'Processing {context.request.url} ...')

            await context.enqueue_links()

        # Run the crawler with the initial list of URLs.
        await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

Page lifecycle hooks

For additional setup or event-driven actions around page creation and closure, the BrowserPool exposes four hooks: pre_page_create_hook, post_page_create_hook, pre_page_close_hook, and post_page_close_hook.

Run on

from __future__ import annotations

import asyncio
import logging
from typing import TYPE_CHECKING, Any

from crawlee.browsers import BrowserPool
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import KeyValueStore

if TYPE_CHECKING:
    from crawlee.browsers._browser_controller import BrowserController
    from crawlee.browsers._types import CrawleePage
    from crawlee.proxy_configuration import ProxyInfo

logger = logging.getLogger(__name__)


async def main() -> None:
    async with BrowserPool() as browser_pool:

        @browser_pool.pre_page_create_hook
        async def log_page_init(
            page_id: str,
            _browser_controller: BrowserController,
            _browser_new_context_options: dict[str, Any],
            _proxy_info: ProxyInfo | None,
        ) -> None:
            """Log when a new page is about to be created."""
            logger.info(f'Creating page {page_id}...')

        @browser_pool.post_page_create_hook
        async def set_viewport(
            crawlee_page: CrawleePage, _browser_controller: BrowserController
        ) -> None:
            """Set a fixed viewport size on each newly created page."""
            await crawlee_page.page.set_viewport_size({'width': 1280, 'height': 1024})

        @browser_pool.pre_page_close_hook
        async def save_screenshot(
            crawlee_page: CrawleePage, _browser_controller: BrowserController
        ) -> None:
            """Save a screenshot to KeyValueStore before each page is closed."""
            kvs = await KeyValueStore.open()

            screenshot = await crawlee_page.page.screenshot()
            await kvs.set_value(
                key=f'screenshot-{crawlee_page.id}',
                value=screenshot,
                content_type='image/png',
            )
            logger.info(f'Saved screenshot for page {crawlee_page.id}.')

        @browser_pool.post_page_close_hook
        async def log_page_closed(
            page_id: str, _browser_controller: BrowserController
        ) -> None:
            """Log after each page is closed."""
            logger.info(f'Page {page_id} closed successfully.')

        crawler = PlaywrightCrawler(
            browser_pool=browser_pool,
            max_requests_per_crawl=5,
        )

        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            context.log.info(f'Processing {context.request.url} ...')

            await context.enqueue_links()

        # Run the crawler with the initial list of URLs.
        await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

Navigation hooks allow for additional configuration at specific points during page navigation. The pre_navigation_hook is called before each navigation and provides PlaywrightPreNavCrawlingContext - including the page instance and a block_requests helper for filtering unwanted resource types and URL patterns. See the block requests example for a dedicated walkthrough. Similarly, the post_navigation_hook is called after each navigation and provides PlaywrightPostNavCrawlingContext - useful for post-load checks such as detecting CAPTCHAs or verifying page state.

Run on

import asyncio

from crawlee.crawlers import (
    PlaywrightCrawler,
    PlaywrightCrawlingContext,
    PlaywrightPostNavCrawlingContext,
    PlaywrightPreNavCrawlingContext,
)
from crawlee.errors import SessionError


async def main() -> None:
    crawler = PlaywrightCrawler(max_requests_per_crawl=10)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        await context.enqueue_links()

    @crawler.pre_navigation_hook
    async def configure_page(context: PlaywrightPreNavCrawlingContext) -> None:
        context.log.info(f'Navigating to {context.request.url} ...')

        # block stylesheets, images, fonts and other static assets
        # to speed up page loading
        await context.block_requests()

    @crawler.post_navigation_hook
    async def custom_captcha_check(context: PlaywrightPostNavCrawlingContext) -> None:
        # check if the page contains a captcha
        captcha_element = context.page.locator('input[name="captcha"]').first
        if await captcha_element.is_visible():
            context.log.warning('Captcha detected! Skipping the page.')
            raise SessionError('Captcha detected')

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

Conclusion

This guide introduced the PlaywrightCrawler and explained how to configure it using BrowserPool and PlaywrightBrowserPlugin. You learned how to launch multiple browsers, configure browser and context settings, use BrowserPool lifecycle hooks, and apply navigation hooks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Playwright crawler

When to use Playwright crawler​

Advanced configuration​

Managing multiple browsers​

Browser launch and context configuration​

Browser pool lifecycle hooks​

Browser launch hooks​

Page lifecycle hooks​

Navigation hooks​

Conclusion​

When to use Playwright crawler

Advanced configuration

Managing multiple browsers

Browser launch and context configuration

Browser pool lifecycle hooks

Browser launch hooks

Page lifecycle hooks

Navigation hooks

Conclusion