Version: 3.13

JSDOM crawler

This example demonstrates how to use JSDOMCrawler to interact with a website using jsdom DOM implementation. Here the script will open a calculator app from the React examples, click 1 + 1 = and extract the result.

Run on

import { JSDOMCrawler, log } from 'crawlee';

// Create an instance of the JSDOMCrawler class - crawler that automatically
// loads the URLs and parses their HTML using the jsdom library.
const crawler = new JSDOMCrawler({
    // Setting the `runScripts` option to `true` allows the crawler to execute client-side
    // JavaScript code on the page. This is required for some websites (such as the React application in this example), but may pose a security risk.
    runScripts: true,
    // This function will be called for each crawled URL.
    // Here we extract the window object from the options and use it to extract data from the page.
    requestHandler: async ({ window }) => {
        const { document } = window;
        // The `document` object is analogous to the `window.document` object you know from your favourite web browsers.
        // Thanks to this, you can use the regular browser-side APIs here.
        document.querySelectorAll('button')[12].click(); // 1
        document.querySelectorAll('button')[15].click(); // +
        document.querySelectorAll('button')[12].click(); // 1
        document.querySelectorAll('button')[18].click(); // =

        const result = document.querySelectorAll('.component-display')[0].childNodes[0] as Element;
        // The result is passed to the console. Unlike with Playwright or Puppeteer crawlers,
        // this console call goes to the Node.js console, not the browser console. All the code here runs right in Node.js!
        log.info(result.innerHTML); // 2
    },
});

// Run the crawler and wait for it to finish.
await crawler.run(['https://ahfarmer.github.io/calculator/']);

log.debug('Crawler finished.');

In the following example, we use JSDOMCrawler to crawl a list of URLs from an external file, load each URL using a plain HTTP request, parse the HTML using the jsdom DOM implementation and extract some data from it: the page title and all h1 tags.

Run on

import { JSDOMCrawler, log, LogLevel } from 'crawlee';

// Crawlers come with various utilities, e.g. for logging.
// Here we use debug level of logging to improve the debugging experience.
// This functionality is optional!
log.setLevel(LogLevel.DEBUG);

// Create an instance of the JSDOMCrawler class - a crawler
// that automatically loads the URLs and parses their HTML using the jsdom library.
const crawler = new JSDOMCrawler({
    // The crawler downloads and processes the web pages in parallel, with a concurrency
    // automatically managed based on the available system memory and CPU (see AutoscaledPool class).
    // Here we define some hard limits for the concurrency.
    minConcurrency: 10,
    maxConcurrency: 50,

    // On error, retry each page at most once.
    maxRequestRetries: 1,

    // Increase the timeout for processing of each page.
    requestHandlerTimeoutSecs: 30,

    // Limit to 10 requests per one crawl
    maxRequestsPerCrawl: 10,

    // This function will be called for each URL to crawl.
    // It accepts a single parameter, which is an object with options as:
    // https://crawlee.dev/js/api/jsdom-crawler/interface/JSDOMCrawlerOptions#requestHandler
    // We use for demonstration only 2 of them:
    // - request: an instance of the Request class with information such as the URL that is being crawled and HTTP method
    // - window: the JSDOM window object
    async requestHandler({ pushData, request, window }) {
        log.debug(`Processing ${request.url}...`);

        // Extract data from the page
        const title = window.document.title;
        const h1texts: { text: string }[] = [];
        window.document.querySelectorAll('h1').forEach((element) => {
            h1texts.push({
                text: element.textContent!,
            });
        });

        // Store the results to the dataset. In local configuration,
        // the data will be stored as JSON files in ./storage/datasets/default
        await pushData({
            url: request.url,
            title,
            h1texts,
        });
    },

    // This function is called if the page processing failed more than maxRequestRetries + 1 times.
    failedRequestHandler({ request }) {
        log.debug(`Request ${request.url} failed twice.`);
    },
});

// Run the crawler and wait for it to finish.
await crawler.run(['https://crawlee.dev']);

log.debug('Crawler finished.');