Skip to main content

Crawlee for Python v0.6

· 4 min read
Vlada Dusek
Developer of Crawlee for Python

Crawlee for Python v0.6 is here, and it's packed with new features and important bug fixes. If you're upgrading from a previous version, please take a moment to review the breaking changes detailed below to ensure a smooth transition.

Crawlee for Python v0.6.0

Getting started

You can upgrade to the latest version straight from PyPI:

pip install --upgrade crawlee

Check out the full changelog on our website to see all the details. If you are updating from an older version, make sure to follow our Upgrading to v0.6 guide.

Adaptive Playwright crawler

The new AdaptivePlaywrightCrawler is a hybrid solution that combines the best of two worlds: full browser rendering with Playwright and lightweight HTTP-based crawling (using, for example, BeautifulSoupCrawler or ParselCrawler). It automatically switches between the two methods based on real-time analysis of the target page, helping you achieve lower crawl costs and improved performance when crawling a variety of websites.

The example below demonstrates how the AdaptivePlaywrightCrawler can handle both static and dynamic content.

import asyncio
from datetime import timedelta

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext


async def main() -> None:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=5,
playwright_crawler_specific_kwargs={'browser_type': 'chromium'},
)

@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
# Do some processing using `parsed_content`
context.log.info(context.parsed_content.title)

# Locate element h2 within 5 seconds
h2 = await context.query_selector_one('h2', timedelta(seconds=5))
# Do stuff with element found by the selector
context.log.info(h2)

# Find more links and enqueue them.
await context.enqueue_links()
# Save some data.
await context.push_data({'Visited url': context.request.url})

await crawler.run(['https://www.crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Check out our Adaptive Playwright crawler guide for more details on how to use this new crawler.

Browserforge fingerprints

To help you avoid detection and blocking, Crawlee now integrates the browserforge library - intelligent browser header & fingerprint generator. This feature simulates real browser behavior by automatically randomizing HTTP headers and fingerprints, making your crawling sessions significantly more resilient against anti-bot measures.

With browserforge fingerprints enabled by default, your crawler sends realistic HTTP headers and user-agent strings. HTTP-based crawlers, which use HttpxHttpClient by default benefit from these adjustments, while the CurlImpersonateHttpClient employs its own stealthy techniques. The PlaywrightCrawler adjusts HTTP headers and browser fingerprints accordingly. Together, these improvements make your crawlers much harder to detect.

Below is an example of using PlaywrightCrawler, which now benefits from the browserforge library:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
# The browserforge fingerprints and headers are used by default.
crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
url = context.request.url
context.log.info(f'Crawling URL: {url}')

# Decode and log the response body, which contains the headers we sent.
headers = (await context.response.body()).decode()
context.log.info(f'Response headers: {headers}')

# Extract and log the User-Agent and UA data used in the browser context.
ua = await context.page.evaluate('() => window.navigator.userAgent')
ua_data = await context.page.evaluate('() => window.navigator.userAgentData')
context.log.info(f'Navigator user-agent: {ua}')
context.log.info(f'Navigator user-agent data: {ua_data}')

# The endpoint httpbin.org/headers returns the request headers in the response body.
await crawler.run(['https://www.httpbin.org/headers'])


if __name__ == '__main__':
asyncio.run(main())

For further details on utilizing browserforge to avoid blocking, please refer to our Avoid getting blocked guide.

CLI dependencies

In v0.6, we've reduced the size of the core package by moving CLI (template creation) dependencies to optional extras. This change reduces the package footprint, keeping the base installation lightweight. To use Crawlee's CLI for creating new projects, simply install the package with the CLI extras.

For example, to create a new project from a template using pipx, run:

pipx run 'crawlee[cli]' create my-crawler

Or with uvx:

uvx 'crawlee[cli]' create my-crawler

This change ensures that while the core package remains lean, you can still opt in to CLI functionality when bootstrapping new projects.

Conclusion

We are excited to share that Crawlee v0.6 is here. If you have any questions or feedback, please open a GitHub discussion. If you encounter any bugs, or have an idea for a new feature, please open a GitHub issue.