Skip to main content

Quick Start

With this short tutorial you can start scraping with Crawlee in a minute or two. To learn in-depth how Crawlee works, read the Introduction, which is a comprehensive step-by-step guide for creating your first scraper.

Choose your crawler

Crawlee comes with three main crawler classes: CheerioCrawler, PuppeteerCrawler and PlaywrightCrawler. All classes share the same interface for maximum flexibility when switching between them.

CheerioCrawler

This is a plain HTTP crawler. It parses HTML using the Cheerio library and crawls the web using the specialized got-scraping HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering.

PuppeteerCrawler

This crawler uses a headless browser to crawl, controlled by the Puppeteer library. It can control Chromium or Chrome. Puppeteer is the de-facto standard in headless browser automation.

PlaywrightCrawler

Playwright is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers. If you're not familiar with Puppeteer already, and you need a headless browser, go with Playwright.

Installation

Crawlee requires Node.js 16 or later. It can be added to any Node.js project by running:

npm install crawlee

Crawling

Run the following example to perform a recursive crawl of the Crawlee website using:

src/main.mjs
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks, log }) {
const { url } = request;

// Extract HTML title of the page.
const title = $('title').text();
log.info(`Title of ${url}: ${title}`);

// Add links from the page that point
// to the same domain as the original request.
await enqueueLinks({ strategy: 'same-domain' });

// Save extracted data to storage.
await Dataset.pushData({ url, title });
},
});

// Add a start URL to the queue and run the crawler.
await crawler.run(['https://crawlee.dev/']);

When you run the example, you will see Crawlee automating the data extraction process in your terminal.

INFO  CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
INFO CheerioCrawler: Title of https://crawlee.dev/: Crawlee · The scalable web crawling, scraping and automation library for JavaScript/Node.js | Crawlee
INFO CheerioCrawler: Title of https://crawlee.dev/docs/examples: Examples | Crawlee
INFO CheerioCrawler: Title of https://crawlee.dev/docs/quick-start: Quick Start | Crawlee
INFO CheerioCrawler: Title of https://crawlee.dev/api/core: @crawlee/core | API | Crawlee
INFO CheerioCrawler: Title of https://crawlee.dev/api/core/changelog: Changelog | API | Crawlee

Running headful browsers

Browsers controlled by Puppeteer and Playwright run headless (without a visible window). You can switch to headful by adding the headless: false option to the crawlers' constructor. This is useful in the development phase when you want to see what's going on in the browser.

src/main.mjs
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
// When you turn off headless mode, the crawler
// will run with a visible browser window.
headless: false,
async requestHandler({ request, page, enqueueLinks, log }) {
const { url } = request;
const title = await page.title();
log.info(`Title of ${url}: ${title}`);
await enqueueLinks({ strategy: 'same-domain' });
await Dataset.pushData({ url, title });
},
});

await crawler.run(['https://crawlee.dev/']);

When you run the example code, you'll see an automated browser blaze through the Crawlee website.

Chrome Scrape

Results

Crawlee stores data to the ./storage directory in your current working directory. The results of your crawl will be available under ./storage/datasets/default/*.json as JSON files.

./storage/datasets/default/000000001.json
{
"url": "https://crawlee.dev/",
"title": "Crawlee · The scalable web crawling, scraping and automation library for JavaScript/Node.js | Crawlee"
}
tip

You can override the storage directory by setting the CRAWLEE_STORAGE_DIR environment variable.

Examples and further reading

You can find more examples showcasing various features of Crawlee in the Examples section of the documentation. To better understand Crawlee and its components you should read the Introduction step-by-step guide.

Related links

Local usage with Crawlee command-line interface (CLI)

To create a boilerplate of your project, you can use the Crawlee CLI tool by running:

npx crawlee create my-cheerio-crawler

The CLI will prompt you to select a project boilerplate template - let's pick the Crawlee cheerio template [TypeScript]. The tool will create a directory called my-cheerio-crawler with Node.js project files. You can run the project as follows:

cd my-cheerio-crawler
npm start