Quick start
This short tutorial will help you start scraping with Crawlee in just a minute or two. For an in-depth understanding of how Crawlee works, check out the Introduction section, which provides a comprehensive step-by-step guide to creating your first scraper.
Choose your crawlerโ
Crawlee offers the following main crawler classes: BeautifulSoupCrawler
, ParselCrawler
, and PlaywrightCrawler
. All crawlers share the same interface, providing maximum flexibility when switching between them.
BeautifulSoupCrawlerโ
The BeautifulSoupCrawler
is a plain HTTP crawler that parses HTML using the well-known BeautifulSoup library. It crawls the web using an HTTP client that mimics a browser. This crawler is very fast and efficient but cannot handle JavaScript rendering.
ParselCrawlerโ
The ParselCrawler
is similar to the BeautifulSoupCrawler
but uses the Parsel library for HTML parsing. Parsel is a lightweight library that provides a CSS selector-based API for extracting data from HTML documents. If you are familiar with the Scrapy framework, you will feel right at home with Parsel. As with the BeautifulSoupCrawler
, the ParselCrawler
cannot handle JavaScript rendering.
PlaywrightCrawlerโ
The PlaywrightCrawler
uses a headless browser controlled by the Playwright library. It can manage Chromium, Firefox, Webkit, and other browsers. Playwright is the successor to the Puppeteer library and is becoming the de facto standard in headless browser automation. If you need a headless browser, choose Playwright.
Crawlee requires Python 3.9 or later.
Installationโ
Crawlee is available as the crawlee
PyPI package. The core functionality is included in the base package, with additional features available as optional extras to minimize package size and dependencies. To install Crawlee with all features, run the following command:
pip install 'crawlee[all]'
Then, install the Playwright dependencies:
playwright install
Verify that Crawlee is successfully installed:
python -c 'import crawlee; print(crawlee.__version__)'
For detailed installation instructions see the Setting up documentation page.
Crawlingโ
Run the following example to perform a recursive crawl of the Crawlee website using the selected crawler.
- BeautifulSoupCrawler
- ParselCrawler
- PlaywrightCrawler
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
# BeautifulSoupCrawler crawls the web using HTTP requests and parses HTML using the BeautifulSoup library.
crawler = BeautifulSoupCrawler(max_requests_per_crawl=50)
# Define a request handler to process each crawled page and attach it to the crawler using a decorator.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract relevant data from the page context.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
# Store the extracted data.
await context.push_data(data)
# Extract links from the current page and add them to the crawling queue.
await context.enqueue_links()
# Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
async def main() -> None:
# ParselCrawler crawls the web using HTTP requests and parses HTML using the Parsel library.
crawler = ParselCrawler(max_requests_per_crawl=50)
# Define a request handler to process each crawled page and attach it to the crawler using a decorator.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract relevant data from the page context.
data = {
'url': context.request.url,
'title': context.selector.xpath('//title/text()').get(),
}
# Store the extracted data.
await context.push_data(data)
# Extract links from the current page and add them to the crawling queue.
await context.enqueue_links()
# Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
# PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
crawler = PlaywrightCrawler()
# Define a request handler to process each crawled page and attach it to the crawler using a decorator.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract relevant data from the page context.
data = {
'url': context.request.url,
'title': await context.page.title(),
}
# Store the extracted data.
await context.push_data(data)
# Extract links from the current page and add them to the crawling queue.
await context.enqueue_links()
# Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
When you run the example, you will see Crawlee automating the data extraction process in your terminal.
Running headful browserโ
By default, browsers controlled by Playwright run in headless mode (without a visible window). However, you can configure the crawler to run in a headful mode, which is useful during development phase to observe the browser's actions. You can alsoswitch from the default Chromium browser to Firefox or WebKit.
from crawlee.crawlers import PlaywrightCrawler
async def main() -> None:
crawler = PlaywrightCrawler(
# Run with a visible browser window.
headless=False,
# Switch to the Firefox browser.
browser_type='firefox',
)
# ...
When you run the example code, you'll see an automated browser navigating through the Crawlee website.
Resultsโ
By default, Crawlee stores data in the ./storage
directory within your current working directory. The results of your crawl will be saved as JSON files under ./storage/datasets/default/
.
To view the results, you can use the cat
command:
cat ./storage/datasets/default/000000001.json
The JSON file will contain data similar to the following:
{
"url": "https://crawlee.dev/",
"title": "Crawlee ยท Build reliable crawlers. Fast. | Crawlee"
}
If you want to change the storage directory, you can set the CRAWLEE_STORAGE_DIR
environment variable to your preferred path.
Examples and further readingโ
For more examples showcasing various features of Crawlee, visit the Examples section of the documentation. To get a deeper understanding of Crawlee and its components, read the step-by-step Introduction guide.