Basic crawler
This is the most bare-bones example of using Crawlee, which demonstrates some of its building blocks such as the BasicCrawler
. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers
like CheerioCrawler
or PlaywrightCrawler
.
The script simply downloads several web pages with plain HTTP requests using the sendRequest
utility function (which uses the got-scraping
npm module internally) and stores their raw HTML and URL in the default dataset. In local configuration, the data will be stored as JSON files in
./storage/datasets/default
.
import { BasicCrawler, Dataset } from 'crawlee';
// Create a BasicCrawler - the simplest crawler that enables
// users to implement the crawling logic themselves.
const crawler = new BasicCrawler({
// This function will be called for each URL to crawl.
async requestHandler({ request, sendRequest, log }) {
const { url } = request;
log.info(`Processing ${url}...`);
// Fetch the page HTML via the crawlee sendRequest utility method
// By default, the method will use the current request that is being handled, so you don't have to
// provide it yourself. You can also provide a custom request if you want.
const { body } = await sendRequest();
// Store the HTML and URL to the default dataset.
await Dataset.pushData({
url,
html: body,
});
},
});
// The initial list of URLs to crawl. Here we use just a few hard-coded URLs.
await crawler.addRequests([
'https://www.google.com',
'https://www.example.com',
'https://www.bing.com',
'https://www.wikipedia.com',
]);
// Run the crawler and wait for it to finish.
await crawler.run();
console.log('Crawler finished.');