Version: 3.15

Saving data

Copy for LLM

A data extraction job would not be complete without saving the data for later use and processing. You've come to the final and most difficult part of this tutorial so make sure to pay attention very carefully!

First, add a new import to the top of the file:

import { PlaywrightCrawler, Dataset } from 'crawlee';

Then, replace the console.log(results) call with:

await Dataset.pushData(results);

and that's it. Unlike earlier, we are being serious now. That's it, you're done. The final code looks like this:

Run on

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        if (request.label === 'DETAIL') {
            const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
            const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

            const title = await page.locator('.product-meta h1').textContent();
            const sku = await page.locator('span.product-meta__sku-number').textContent();

            const priceElement = page
                .locator('span.price')
                .filter({
                    hasText: '$',
                })
                .first();

            const currentPriceString = await priceElement.textContent();
            const rawPrice = currentPriceString.split('$')[1];
            const price = Number(rawPrice.replaceAll(',', ''));

            const inStockElement = page
                .locator('span.product-form__inventory')
                .filter({
                    hasText: 'In stock',
                })
                .first();

            const inStock = (await inStockElement.count()) > 0;

            const results = {
                url: request.url,
                manufacturer,
                title,
                sku,
                currentPrice: price,
                availableInStock: inStock,
            };

            await Dataset.pushData(results);
        } else if (request.label === 'CATEGORY') {
            // We are now on a category page. We can use this to paginate through and enqueue all products,
            // as well as any subsequent pages we find

            await page.waitForSelector('.product-item > a');
            await enqueueLinks({
                selector: '.product-item > a',
                label: 'DETAIL', // <= note the different label
            });

            // Now we need to find the "Next" button and enqueue the next page of results (if it exists)
            const nextButton = await page.$('a.pagination__next');
            if (nextButton) {
                await enqueueLinks({
                    selector: 'a.pagination__next',
                    label: 'CATEGORY', // <= note the same label
                });
            }
        } else {
            // This means we're on the start page, with no label.
            // On this page, we just want to enqueue all the category pages.

            await page.waitForSelector('.collection-block-item');
            await enqueueLinks({
                selector: '.collection-block-item',
                label: 'CATEGORY',
            });
        }
    },

    // Let's limit our crawls to make our tests shorter and safer.
    maxRequestsPerCrawl: 50,
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

What's `Dataset.pushData()`

Dataset.pushData() is a function that saves data to the default Dataset. Dataset is a storage designed to hold data in a format similar to a table. Each time you call Dataset.pushData() a new row in the table is created, with the property names serving as column titles. In the default configuration, the rows are represented as JSON files saved on your disk, but other storage systems can be plugged into Crawlee as well.

Automatic dataset initialization in Crawlee

Each time you start Crawlee a default Dataset is automatically created, so there's no need to initialize it or create an instance first. You can create as many datasets as you want and even give them names. For more details see the Result storage guide and the Dataset.open() function.

Finding saved data

Unless you changed the configuration that Crawlee uses locally, which would suggest that you knew what you were doing, and you didn't need this tutorial anyway, you'll find your data in the storage directory that Crawlee creates in the working directory of the running script:

{PROJECT_FOLDER}/storage/datasets/default/

The above folder will hold all your saved data in numbered files, as they were pushed into the dataset. Each file represents one invocation of Dataset.pushData() or one table row.

Single file data storage options

If you would like to store your data in a single big file, instead of many small ones, see the Result storage guide for Key-value stores.

Next steps

Next, you'll see some improvements that you can add to your crawler code that will make it more readable and maintainable in the long run.

Saving data

What's Dataset.pushData()​

Finding saved data​

Next steps​

What's `Dataset.pushData()`

Finding saved data

Next steps