Skip to main content

Playwright crawler

A PlaywrightCrawler is a browser-based crawler. In contrast to HTTP-based crawlers like ParselCrawler or BeautifulSoupCrawler, it uses a real browser to render pages and extract data. It is built on top of the Playwright browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.

When to use Playwright crawler

Use PlaywrightCrawler in scenarios that require full browser capabilities, such as:

  • Dynamic content rendering: Required when pages rely on heavy JavaScript to load or modify content in the browser.
  • Anti-scraping protection: Helpful for sites using JavaScript-based security or advanced anti-automation measures.
  • Complex cookie management: Necessary for sites with session or cookie requirements that standard HTTP-based crawlers cannot handle easily.

If HTTP-based crawlers are insufficient, PlaywrightCrawler can address these challenges. See a basic example for a typical usage demonstration.

Advanced configuration

The PlaywrightCrawler uses other Crawlee components under the hood, notably BrowserPool and PlaywrightBrowserPlugin. These components let you to configure the browser and context settings, launch multiple browsers, and apply pre-navigation hooks. You can create your own instances of these components and pass them to the PlaywrightCrawler constructor.

Managing multiple browsers

The BrowserPool allows you to manage multiple browsers. Each browser instance is managed by a separate PlaywrightBrowserPlugin and can be configured independently. This is useful for scenarios like testing multiple configurations or implementing browser rotation to help avoid blocks or detect different site behaviors.

import asyncio

from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
# Create a plugin for each required browser.
plugin_chromium = PlaywrightBrowserPlugin(browser_type='chromium', max_open_pages_per_browser=1)
plugin_firefox = PlaywrightBrowserPlugin(browser_type='firefox', max_open_pages_per_browser=1)

crawler = PlaywrightCrawler(
browser_pool=BrowserPool(plugins=[plugin_chromium, plugin_firefox]),
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
browser_name = context.page.context.browser.browser_type.name if context.page.context.browser else 'undefined'
context.log.info(f'Processing {context.request.url} with {browser_name} ...')

await context.enqueue_links()

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev', 'https://apify.com/'])


if __name__ == '__main__':
asyncio.run(main())

Browser launch and context configuration

The PlaywrightBrowserPlugin provides access to all relevant Playwright configuration options for both browser launches and new browser contexts. You can specify these options in the constructor of PlaywrightBrowserPlugin or PlaywrightCrawler:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
crawler = PlaywrightCrawler(
headless=False,
browser_type='chromium',
# Browser launch options
browser_launch_options={
# For support `msedge` channel you need to install it `playwright install msedge`
'channel': 'msedge',
'slow_mo': 200,
},
# Context launch options, applied to each page as it is created
browser_new_context_options={
'color_scheme': 'dark',
# Set headers
'extra_http_headers': {
'Custom-Header': 'my-header',
'Accept-Language': 'en',
},
# Set only User Agent
'user_agent': 'My-User-Agent',
},
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await context.enqueue_links()

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

You can also configure each plugin used by BrowserPool:

from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler

crawler = PlaywrightCrawler(
browser_pool=BrowserPool(
plugins=[
PlaywrightBrowserPlugin(
browser_type='chromium',
browser_launch_options={
'headless': False,
'channel': 'msedge',
'slow_mo': 200,
},
browser_new_context_options={
'color_scheme': 'dark',
'extra_http_headers': {
'Custom-Header': 'my-header',
'Accept-Language': 'en',
},
'user_agent': 'My-User-Agent',
},
)
]
)
)

For an example of how to implement a custom browser plugin, see the Camoufox example. Camoufox is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with PlaywrightCrawler.

Page configuration with pre-navigation hooks

In some use cases, you may need to configure the page before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the pre_navigation_hook method of the PlaywrightCrawler. This method is called before the page navigates to the target URL and allows you to configure the page instance.

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext, PlaywrightPreNavCrawlingContext


async def main() -> None:
crawler = PlaywrightCrawler(max_requests_per_crawl=10)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await context.enqueue_links()

@crawler.pre_navigation_hook
async def log_navigation_url(context: PlaywrightPreNavCrawlingContext) -> None:
context.log.info(f'Navigating to {context.request.url} ...')

# will set a timeout for all navigation methods
context.page.set_default_navigation_timeout(600_000)

# will set the page size before you go to the target URL
await context.page.set_viewport_size({'width': 1280, 'height': 1024})

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
asyncio.run(main())

Conclusion

This guide introduced the PlaywrightCrawler and explained how to configure it using BrowserPool and PlaywrightBrowserPlugin. You learned how to launch multiple browsers, configure browser and context settings, and apply pre-navigation hooks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!