Version: 3.17

Crawl a sitemap

We will crawl sitemap which tells search engines which pages and file are important in the website, it also provides valuable information about these files. This example builds a sitemap crawler which downloads and crawls the URLs from a sitemap, by using the Sitemap utility class provided by the @crawlee/utils module.

Cheerio Crawler
Puppeteer Crawler
Playwright Crawler

Run on

import { CheerioCrawler, Sitemap } from 'crawlee';

const crawler = new CheerioCrawler({
    // Function called for each URL
    async requestHandler({ request, log }) {
        log.info(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml');

await crawler.addRequests(urls);

// Run the crawler
await crawler.run();

tip

To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome image for your Dockerfile.

Run on

import { PuppeteerCrawler, Sitemap } from 'crawlee';

const crawler = new PuppeteerCrawler({
    // Function called for each URL
    async requestHandler({ request, log }) {
        log.info(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml');

await crawler.addRequests(urls);

// Run the crawler
await crawler.run();

tip

To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome image for your Dockerfile.

Run on

import { PlaywrightCrawler, Sitemap } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Function called for each URL
    async requestHandler({ request, log }) {
        log.info(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml');

await crawler.addRequests(urls);

// Run the crawler
await crawler.run();