Crawlee is a web
scraping and browser
automation library
Crawlee is a web
scraping and browser
automation library
It helps you build reliable crawlers. Fast.
![](/python/assets/images/logo-blur-0157a860c1a2da5bf8d0d8188453045b.png)
pipx run crawlee create my-crawler
Reliable crawling 🏗️
Crawlee won't fix broken selectors for you (yet), but it helps you build and maintain your crawlers faster.
When a website adds JavaScript rendering, you don't have to rewrite everything, only switch to one of the browser crawlers. When you later find a great API to speed up your crawls, flip the switch back.
Crawlee is built by people who scrape for a living and use it every day to scrape millions of pages. Meet our community on Discord.
Python with type hints
Crawlee for Python is written in a modern way using type hints, providing code completion in your IDE and helping you catch bugs early on build time.
Headless browsers
Switch your crawlers from HTTP to headless browsers in 3 lines of code. Crawlee builds on top of Playwright and adds its own anti-blocking features and human-like fingerprints. Chrome, Firefox and more.
Automatic scaling and proxy management
Crawlee automatically manages concurrency based on available system resources and smartly rotates proxies. Proxies that often time-out, return network errors or bad HTTP codes like 401 or 403 are discarded.
Try Crawlee out 👾
The fastest way to try Crawlee out is to use the Crawlee CLI and choose one of the provided templates. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.
pipx run crawlee create my-crawler
If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler
we also need to install Playwright. It's not bundled with Crawlee to reduce install size.
pip install 'crawlee[playwright]'
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
# Create a crawler instance and provide a request provider (and other optional arguments)
crawler = PlaywrightCrawler(
max_requests_per_crawl=50, # scrape only first 50 pages
# headless=False,
# browser_type='firefox',
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
data = {
'request_url': context.request.url,
'page_url': context.page.url,
'page_title': await context.page.title(),
'page_content': (await context.page.content())[:10000],
}
await context.push_data(data)
await crawler.run(['https://crawlee.dev'])
# Export the whole dataset to a single file in `./result.csv`.
await crawler.export_data('./result.csv')
# Or work with the data directly.
const data = await crawler.get_data()
print(data.items)
if __name__ == '__main__':
# Add first URL to the queue and start the crawl.
asyncio.run(main())