Result storage

This guide explains the available result storage options in Crawlee and how to use them to store and retrieve data.

Introduction

Crawlee provides multiple result storage options, each suited to specific use cases. By default, data is saved to a local directory specified by the CRAWLEE_STORAGE_DIR environment variable. If this variable is not set, Crawlee defaults to using ./storage within the current working directory.

Result storages are managed by storage clients, which are subclasses of the BaseStorageClient. For example, the MemoryStorageClient stores data in memory and can also offload them to the local directory. Data are stored in the following directory structure:

{CRAWLEE_STORAGE_DIR}/{result_storage}/{QUEUE_ID}/

note

Local directory is specified by the CRAWLEE_STORAGE_DIR environment variable with default value ./storage. {QUEUE_ID} is the name or ID of the specific storage. The default value is default, unless we override it by setting the CRAWLEE_DEFAULT_DATASET_ID or CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID environment variables.

Dataset

The Dataset is designed for storing structured data, where each entry has a consistent set of attributes, such as products in an online store or real estate listings. Think of a Dataset as a table: each entry corresponds to a row, with attributes represented as columns. Datasets are append-only, allowing you to add new records but not modify or delete existing ones. Every Crawlee project run is associated with a default dataset, typically used to store results specific to that crawler execution. However, using this dataset is optional.

By default, data are stored using the following path structure:

{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json

{CRAWLEE_STORAGE_DIR}: The root directory for all storage data specified by the environment variable.
{DATASET_ID}: The dataset's ID, "default" by default.
{INDEX}: Represents the zero-based index of the record within the dataset.

The following code demonstrates basic operations of the dataset:

Basic usage
Usage with Crawler
Explicit usage with Crawler

import asyncio

from crawlee.storages import Dataset


async def main() -> None:
    # Open the dataset, if it does not exist, it will be created.
    # Leave name empty to use the default dataset.
    dataset = await Dataset.open()

    # Push a single row of data.
    await dataset.push_data({'foo': 'bar'})

    # Push multiple rows of data (anything JSON-serializable can be pushed).
    await dataset.push_data([{'foo': 'bar2', 'col2': 'val2'}, {'col3': 123}])

    # Fetch all data from the dataset.
    data = await dataset.get_data()
    # Do something with it...

    # Remove the dataset.
    await dataset.drop()


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    # Create a new crawler (it can be any subclass of BasicCrawler).
    crawler = BeautifulSoupCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the (default) dataset.
        await context.push_data(data)

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])

    # Export the dataset to a file.
    await crawler.export_data(path='dataset.csv')


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset


async def main() -> None:
    # Open the dataset, if it does not exist, it will be created.
    # Leave name empty to use the default dataset.
    dataset = await Dataset.open()

    # Create a new crawler (it can be any subclass of BasicCrawler).
    crawler = BeautifulSoupCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the dataset.
        await dataset.push_data(data)

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])

    # Export the dataset to the key-value store.
    await dataset.export_to(key='dataset', content_type='csv')


if __name__ == '__main__':
    asyncio.run(main())

Key-value store

The KeyValueStore is designed to save and retrieve data records or files efficiently. Each record is uniquely identified by a key and is associated with a specific MIME type, making the KeyValueStore ideal for tasks like saving web page screenshots, PDFs, or tracking the state of crawlers.

By default, data are stored using the following path structure:

{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT}

{CRAWLEE_STORAGE_DIR}: The root directory for all storage data specified by the environment variable.
{STORE_ID}: The KVS's ID, "default" by default.
{KEY}: The unique key for the record.
{EXT}: The file extension corresponding to the MIME type of the content.

The following code demonstrates the usage of the KeyValueStore:

Basic usage
Usage with Crawler
Explicit usage with Crawler

import asyncio

from crawlee.storages import KeyValueStore


async def main() -> None:
    # Open the key-value store, if it does not exist, it will be created.
    # Leave name empty to use the default KVS.
    kvs = await KeyValueStore.open()

    # Set a value associated with 'some-key'.
    await kvs.set_value(key='some-key', value={'foo': 'bar'})

    # Get the value associated with 'some-key'.
    value = kvs.get_value('some-key')
    # Do something with it...

    # Delete the value associated with 'some-key' by setting it to None.
    await kvs.set_value(key='some-key', value=None)

    # Remove the key-value store.
    await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    # Create a new Playwright crawler.
    crawler = PlaywrightCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Capture the screenshot of the page using Playwright's API.
        screenshot = await context.page.screenshot()
        name = context.request.url.split('/')[-1]

        # Get the key-value store from the context. # If it does not exist,
        # it will be created. Leave name empty to use the default KVS.
        kvs = await context.get_key_value_store()

        # Store the screenshot in the key-value store.
        await kvs.set_value(
            key=f'screenshot-{name}',
            value=screenshot,
            content_type='image/png',
        )

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import KeyValueStore


async def main() -> None:
    # Open the key-value store, if it does not exist, it will be created.
    # Leave name empty to use the default KVS.
    kvs = await KeyValueStore.open()

    # Create a new Playwright crawler.
    crawler = PlaywrightCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Capture the screenshot of the page using Playwright's API.
        screenshot = await context.page.screenshot()
        name = context.request.url.split('/')[-1]

        # Store the screenshot in the key-value store.
        await kvs.set_value(
            key=f'screenshot-{name}',
            value=screenshot,
            content_type='image/png',
        )

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

To see a real-world example of how to get the input from the key-value store, see the Screenshots example.

We offer several helper functions to simplify interactions with result storages:

The push_data function allows you to manually add data to the dataset. You can optionally specify the dataset ID or its name.
The get_key_value_store function retrieves the key-value store for the current crawler run. If the KVS does not exist, it will be created. You can also specify the KVS's ID or its name.

Cleaning up the storages

Default storages are purged before the crawler starts, unless explicitly configured otherwise. For that case, see Configuration.purge_on_start. This cleanup happens as soon as a storage is accessed, either when you open a storage (e.g. using Dataset.open, KeyValueStore.open) or when interacting with a storage through one of the helper functions (e.g. push_data), which implicitly opens the result storage).

import asyncio

from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    config = Configuration(purge_on_start=False)
    crawler = HttpCrawler(configuration=config)

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

If you do not explicitly interact with storages in your code, the purging will occur automatically when the BasicCrawler.run method is invoked.

If you need to purge storages earlier, you can call MemoryStorageClient.purge_on_start directly. This method triggers the purging process for the underlying storage implementation you are currently using.

import asyncio

from crawlee.storage_clients import MemoryStorageClient

async def main() -> None:
    storage_client = MemoryStorageClient.from_config()
    await storage_client.purge_on_start()

if __name__ == '__main__':
    asyncio.run(main())

Introduction​

Dataset​

Key-value store​

Result-related helpers​

Cleaning up the storages​

Introduction

Dataset

Key-value store

Result-related helpers

Cleaning up the storages