Version: 3.11

Dataset Map and Reduce methods

Copy for LLM

This example shows an easy use-case of the Dataset map and reduce methods. Both methods can be used to simplify the dataset results workflow process. Both can be called on the dataset directly.

Important to mention is that both methods return a new result (map returns a new array and reduce can return any type) - neither method updates the dataset in any way.

Examples for both methods are demonstrated on a simple dataset containing the results scraped from a page: the URL and a hypothetical number of h1 - h3 header elements under the headingCount key.

This data structure is stored in the default dataset under {PROJECT_FOLDER}/storage/datasets/default/. If you want to simulate the functionality, you can use the dataset.pushData() method to save the example JSON array to your dataset.

[
    {
        "url": "https://crawlee.dev/",
        "headingCount": 11
    },
    {
        "url": "https://crawlee.dev/storage",
        "headingCount": 8
    },
    {
        "url": "https://crawlee.dev/proxy",
        "headingCount": 4
    }
]

Map

The dataset map method is very similar to standard Array mapping methods. It produces a new array of values by mapping each value in the existing array through a transformation function and an options parameter.

The map method used to check if are there more than 5 header elements on each page:

Run on

import { Dataset, KeyValueStore } from 'crawlee';

const dataset = await Dataset.open<{
    url: string;
    headingCount: number;
}>();

// Seeding the dataset with some data
await dataset.pushData([
    {
        url: 'https://crawlee.dev/',
        headingCount: 11,
    },
    {
        url: 'https://crawlee.dev/storage',
        headingCount: 8,
    },
    {
        url: 'https://crawlee.dev/proxy',
        headingCount: 4,
    },
]);

// Calling map function and filtering through mapped items...
const moreThan5headers = (await dataset.map((item) => item.headingCount)).filter((count) => count > 5);

// Saving the result of map to default key-value store...
await KeyValueStore.setValue('pages_with_more_than_5_headers', moreThan5headers);

The moreThan5headers variable is an array of headingCount attributes where the number of headers is greater than 5.

The map method's result value saved to the key-value store should be:

[11, 8]

Reduce

The dataset reduce method does not produce a new array of values - it reduces a list of values down to a single value. The method iterates through the items in the dataset using the memo argument. After performing the necessary calculation, the memo is sent to the next iteration, while the item just processed is reduced (removed).

Using the reduce method to get the total number of headers scraped (all items in the dataset):

Run on

import { Dataset, KeyValueStore } from 'crawlee';

const dataset = await Dataset.open<{
    url: string;
    headingCount: number;
}>();

// Seeding the dataset with some data
await dataset.pushData([
    {
        url: 'https://crawlee.dev/',
        headingCount: 11,
    },
    {
        url: 'https://crawlee.dev/storage',
        headingCount: 8,
    },
    {
        url: 'https://crawlee.dev/proxy',
        headingCount: 4,
    },
]);

// calling reduce function and using memo to calculate number of headers
const pagesHeadingCount = await dataset.reduce((memo, value) => {
    return memo + value.headingCount;
}, 0);

// saving result of map to default Key-value store
await KeyValueStore.setValue('pages_heading_count', pagesHeadingCount);

The original dataset will be reduced to a single value, pagesHeadingCount, which contains the count of all headers for all scraped pages (all dataset items).

The reduce method's result value saved to the key-value store should be:

Dataset Map and Reduce methods

Map​

Reduce​

Map

Reduce