Dataset Map and Reduce methods
This example shows an easy use-case of the Dataset
map
and reduce
methods. Both methods can be used to simplify
the dataset results workflow process. Both can be called on the dataset directly.
Important to mention is that both methods return a new result (map
returns a new array and reduce
can return any type) - neither method updates
the dataset in any way.
Examples for both methods are demonstrated on a simple dataset containing the results scraped from a page: the URL
and a hypothetical number of
h1
- h3
header elements under the headingCount
key.
This data structure is stored in the default dataset under {PROJECT_FOLDER}/storage/datasets/default/
. If you want to simulate the
functionality, you can use the dataset.pushData()
method to save the example JSON array
to your dataset.
[
{
"url": "https://crawlee.dev/",
"headingCount": 11
},
{
"url": "https://crawlee.dev/storage",
"headingCount": 8
},
{
"url": "https://crawlee.dev/proxy",
"headingCount": 4
}
]
Map
The dataset map
method is very similar to standard Array mapping methods. It produces a new array of values by mapping each value in the existing
array through a transformation function and an options parameter.
The map
method used to check if are there more than 5 header elements on each page:
import { Dataset, KeyValueStore } from 'crawlee';
const dataset = await Dataset.open<{ headingCount: number }>();
// calling map function and filtering through mapped items
const moreThan5headers = (await dataset.map((item) => item.headingCount)).filter((count) => count > 5);
// saving result of map to default Key-value store
await KeyValueStore.setValue('pages_with_more_than_5_headers', moreThan5headers);
The moreThan5headers
variable is an array of headingCount
attributes where the number of headers is greater than 5.
The map
method's result value saved to the key-value store
should be:
[11, 8]
Reduce
The dataset reduce
method does not produce a new array of values - it reduces a list of values down to a single value. The method iterates through
the items in the dataset using the memo
argument. After performing the necessary
calculation, the memo
is sent to the next iteration, while the item just processed is reduced (removed).
Using the reduce
method to get the total number of headers scraped (all items in the dataset):
import { Dataset, KeyValueStore } from 'crawlee';
const dataset = await Dataset.open();
// calling reduce function and using memo to calculate number of headers
const pagesHeadingCount = await dataset.reduce((memo, value) => {
return memo + value.headingCount;
}, 0);
// saving result of map to default Key-value store
await KeyValueStore.setValue('pages_heading_count', pagesHeadingCount);
The original dataset will be reduced to a single value, pagesHeadingCount
, which contains the count of all headers for all scraped pages (all
dataset items).
The reduce
method's result value saved to the key-value store
should be:
23