Version: Next

Basic crawler

Copy for LLM

This is the most bare-bones example of using Crawlee, which demonstrates some of its building blocks such as the BasicCrawler. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers like CheerioCrawler or PlaywrightCrawler.

The script simply downloads several web pages with plain HTTP requests using the sendRequest utility function (which uses the got-scraping npm module internally) and stores their raw HTML and URL in the default dataset. In local configuration, the data will be stored as JSON files in ./storage/datasets/default.

Run on

import { BasicCrawler } from 'crawlee';

// Create a BasicCrawler - the simplest crawler that enables
// users to implement the crawling logic themselves.
const crawler = new BasicCrawler({
    // This function will be called for each URL to crawl.
    async requestHandler({ pushData, request, sendRequest, log }) {
        const { url } = request;
        log.info(`Processing ${url}...`);

        // Fetch the page HTML via the crawlee sendRequest utility method
        // By default, the method will use the current request that is being handled, so you don't have to
        // provide it yourself. You can also provide a custom request if you want.
        const { body } = await sendRequest();

        // Store the HTML and URL to the default dataset.
        await pushData({
            url,
            html: body,
        });
    },
});

// The initial list of URLs to crawl. Here we use just a few hard-coded URLs.
await crawler.addRequests([
    'https://www.google.com',
    'https://www.example.com',
    'https://www.bing.com',
    'https://www.wikipedia.com',
]);

// Run the crawler and wait for it to finish.
await crawler.run();

console.log('Crawler finished.');