Playwright crawler
A PlaywrightCrawler
is a browser-based crawler. In contrast to HTTP-based crawlers like ParselCrawler
or BeautifulSoupCrawler
, it uses a real browser to render pages and extract data. It is built on top of the Playwright browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.
When to use Playwright crawler
Use PlaywrightCrawler
in scenarios that require full browser capabilities, such as:
- Dynamic content rendering: Required when pages rely on heavy JavaScript to load or modify content in the browser.
- Anti-scraping protection: Helpful for sites using JavaScript-based security or advanced anti-automation measures.
- Complex cookie management: Necessary for sites with session or cookie requirements that standard HTTP-based crawlers cannot handle easily.
If HTTP-based crawlers are insufficient, PlaywrightCrawler
can address these challenges. See a basic example for a typical usage demonstration.
Advanced configuration
The PlaywrightCrawler
uses other Crawlee components under the hood, notably BrowserPool
and PlaywrightBrowserPlugin
. These components let you to configure the browser and context settings, launch multiple browsers, and apply pre-navigation hooks. You can create your own instances of these components and pass them to the PlaywrightCrawler
constructor.
- The
PlaywrightBrowserPlugin
manages how browsers are launched and how browser contexts are created. It accepts browser launch and new context options. - The
BrowserPool
manages the lifecycle of browser instances (launching, recycling, etc.). You can customize its behavior to suit your needs.
Managing multiple browsers
The BrowserPool
allows you to manage multiple browsers. Each browser instance is managed by a separate PlaywrightBrowserPlugin
and can be configured independently. This is useful for scenarios like testing multiple configurations or implementing browser rotation to help avoid blocks or detect different site behaviors.
import asyncio
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
# Create a plugin for each required browser.
plugin_chromium = PlaywrightBrowserPlugin(browser_type='chromium', max_open_pages_per_browser=1)
plugin_firefox = PlaywrightBrowserPlugin(browser_type='firefox', max_open_pages_per_browser=1)
crawler = PlaywrightCrawler(
browser_pool=BrowserPool(plugins=[plugin_chromium, plugin_firefox]),
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
browser_name = context.page.context.browser.browser_type.name if context.page.context.browser else 'undefined'
context.log.info(f'Processing {context.request.url} with {browser_name} ...')
await context.enqueue_links()
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev', 'https://apify.com/'])
if __name__ == '__main__':
asyncio.run(main())
Browser launch and context configuration
The PlaywrightBrowserPlugin
provides access to all relevant Playwright configuration options for both browser launches and new browser contexts. You can specify these options in the constructor of PlaywrightBrowserPlugin
or PlaywrightCrawler
:
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(
headless=False,
browser_type='chromium',
# Browser launch options
browser_launch_options={
# For support `msedge` channel you need to install it `playwright install msedge`
'channel': 'msedge',
'slow_mo': 200,
},
# Context launch options, applied to each page as it is created
browser_new_context_options={
'color_scheme': 'dark',
# Set headers
'extra_http_headers': {
'Custom-Header': 'my-header',
'Accept-Language': 'en',
},
# Set only User Agent
'user_agent': 'My-User-Agent',
},
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
await context.enqueue_links()
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
You can also configure each plugin used by BrowserPool
:
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler
crawler = PlaywrightCrawler(
browser_pool=BrowserPool(
plugins=[
PlaywrightBrowserPlugin(
browser_type='chromium',
browser_launch_options={
'headless': False,
'channel': 'msedge',
'slow_mo': 200,
},
browser_new_context_options={
'color_scheme': 'dark',
'extra_http_headers': {
'Custom-Header': 'my-header',
'Accept-Language': 'en',
},
'user_agent': 'My-User-Agent',
},
)
]
)
)
For an example of how to implement a custom browser plugin, see the Camoufox example. Camoufox is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with PlaywrightCrawler
.
Page configuration with pre-navigation hooks
In some use cases, you may need to configure the page before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the pre_navigation_hook
method of the PlaywrightCrawler
. This method is called before the page navigates to the target URL and allows you to configure the page instance.
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext, PlaywrightPreNavCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(max_requests_per_crawl=10)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
await context.enqueue_links()
@crawler.pre_navigation_hook
async def log_navigation_url(context: PlaywrightPreNavCrawlingContext) -> None:
context.log.info(f'Navigating to {context.request.url} ...')
# will set a timeout for all navigation methods
context.page.set_default_navigation_timeout(600_000)
# will set the page size before you go to the target URL
await context.page.set_viewport_size({'width': 1280, 'height': 1024})
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
Conclusion
This guide introduced the PlaywrightCrawler
and explained how to configure it using BrowserPool
and PlaywrightBrowserPlugin
. You learned how to launch multiple browsers, configure browser and context settings, and apply pre-navigation hooks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!