Version: 3.15

Crawl a sitemap

Copy for LLM

We will crawl sitemap which tells search engines which pages and file are important in the website, it also provides valuable information about these files. This example builds a sitemap crawler which downloads and crawls the URLs from a sitemap, by using the Sitemap utility class provided by the @crawlee/utils module.

Cheerio Crawler
Puppeteer Crawler
Playwright Crawler

Run on

import { CheerioCrawler, Sitemap } from 'crawlee';

const crawler = new CheerioCrawler({
    // Function called for each URL
    async requestHandler({ request, log }) {
        log.info(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml');

await crawler.addRequests(urls);

// Run the crawler
await crawler.run();

tip

To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome image for your Dockerfile.

Run on

import { PuppeteerCrawler, Sitemap } from 'crawlee';

const crawler = new PuppeteerCrawler({
    // Function called for each URL
    async requestHandler({ request, log }) {
        log.info(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml');

await crawler.addRequests(urls);

// Run the crawler
await crawler.run();

tip

To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome image for your Dockerfile.

Run on

import { PlaywrightCrawler, Sitemap } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Function called for each URL
    async requestHandler({ request, log }) {
        log.info(request.url);
    },
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap)
});

const { urls } = await Sitemap.load('https://crawlee.dev/sitemap.xml');

await crawler.addRequests(urls);

// Run the crawler
await crawler.run();