Storages

Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the RequestQueue, help you store and deduplicate URLs, while result-oriented storages, like Dataset and KeyValueStore, focus on storing and retrieving scraping results. This guide explains when to use each type, how to interact with them, and how to control their lifecycle.

Overview

Crawlee's storage system consists of two main layers:

Storages (Dataset, KeyValueStore, RequestQueue): High-level interfaces for interacting with different storage types.
Storage clients (MemoryStorageClient, FileSystemStorageClient, etc.): Backend implementations that handle the actual data persistence and management.

For more information about storage clients and their configuration, see the Storage clients guide.

Named and unnamed storages

Crawlee supports two types of storages:

Named storages: Persistent storages with a specific name that persist across runs. These are useful when you want to share data between different crawler runs or access the same storage from multiple places.
Unnamed storages: Temporary storages identified by an alias that are scoped to a single run. These are automatically purged at the start of each run (when purge_on_start is enabled, which is the default).

Default storage

Each storage type (Dataset, KeyValueStore, RequestQueue) has a default instance that can be accessed without specifying id, name or alias. Default unnamed storage is accessed by calling storage's open method without parameters. This is the most common way to use storages in simple crawlers. The special alias "default" is equivalent to calling open without parameters

Run on

import asyncio

from crawlee.storages import Dataset


async def main() -> None:
    # Named storage (persists across runs)
    dataset_named = await Dataset.open(name='my-persistent-dataset')

    # Unnamed storage with alias (purged on start)
    dataset_unnamed = await Dataset.open(alias='temporary-results')

    # Default unnamed storage (both are equivalent and purged on start)
    dataset_default = await Dataset.open()
    dataset_default = await Dataset.open(alias='default')


if __name__ == '__main__':
    asyncio.run(main())

Request queue

The RequestQueue is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a default request queue, which can be used to store URLs during a specific run.

The following code demonstrates the usage of the RequestQueue:

Basic usage
Usage with Crawler
Explicit usage with Crawler

Run on

import asyncio

from crawlee.storages import RequestQueue


async def main() -> None:
    # Open the request queue, if it does not exist, it will be created.
    # Leave name empty to use the default request queue.
    request_queue = await RequestQueue.open(name='my-request-queue')

    # Add a single request.
    await request_queue.add_request('https://apify.com/')

    # Add multiple requests as a batch.
    await request_queue.add_requests(
        ['https://crawlee.dev/', 'https://crawlee.dev/python/']
    )

    # Fetch and process requests from the queue.
    while request := await request_queue.fetch_next_request():
        # Do something with it...

        # And mark it as handled.
        await request_queue.mark_request_as_handled(request)

    # Remove the request queue.
    await request_queue.drop()


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    # Create a new crawler (it can be any subclass of BasicCrawler). Request queue is
    # a default request manager, it will be opened, and fully managed if not specified.
    crawler = HttpCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Use context's add_requests method helper to add new requests from the handler.
        await context.add_requests(['https://crawlee.dev/python/'])

    # Use crawler's add_requests method helper to add new requests.
    await crawler.add_requests(['https://apify.com/'])

    # Run the crawler. You can optionally pass the list of initial requests.
    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.storages import RequestQueue


async def main() -> None:
    # Open the request queue, if it does not exist, it will be created.
    # Leave name empty to use the default request queue.
    request_queue = await RequestQueue.open(name='my-request-queue')

    # Interact with the request queue directly, e.g. add a batch of requests.
    await request_queue.add_requests(['https://apify.com/', 'https://crawlee.dev/'])

    # Create a new crawler (it can be any subclass of BasicCrawler) and pass the request
    # queue as request manager to it. It will be managed by the crawler.
    crawler = HttpCrawler(request_manager=request_queue)

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    # And execute the crawler.
    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Crawlee provides helper functions to simplify interactions with the RequestQueue:

The add_requests function allows you to manually add specific URLs to the configured request storage. In this case, you must explicitly provide the URLs you want to be added to the request storage. If you need to specify further details of the request, such as a label or user_data, you have to pass instances of the Request class to the helper.
The enqueue_links function is designed to discover new URLs in the current page and add them to the request storage. It can be used with default settings, requiring no arguments, or you can customize its behavior by specifying link element selectors, choosing different enqueue strategies, or applying include/exclude filters to control which URLs are added. See Crawl website with relative links example for more details.

Add requests
Enqueue links

Run on

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.add_requests(['https://apify.com/'])

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Request manager

The RequestQueue implements the RequestManager interface, offering a unified API for interacting with various request storage types. This provides a unified way to interact with different request storage types.

If you need custom functionality, you can create your own request storage by subclassing the RequestManager class and implementing its required methods.

For a detailed explanation of the RequestManager and other related components, refer to the Request loaders guide.

Dataset

The Dataset is designed for storing structured data, where each entry has a consistent set of attributes, such as products in an online store or real estate listings. Think of a Dataset as a table: each entry corresponds to a row, with attributes represented as columns. Datasets are append-only, allowing you to add new records but not modify or delete existing ones. Every Crawlee project run is associated with a default dataset, typically used to store results specific to that crawler execution. However, using this dataset is optional.

The following code demonstrates basic operations of the dataset:

Basic usage
Usage with Crawler
Explicit usage with Crawler

Run on

import asyncio

from crawlee.storages import Dataset


async def main() -> None:
    # Open the dataset, if it does not exist, it will be created.
    # Leave name empty to use the default dataset.
    dataset = await Dataset.open(name='my-dataset')

    # Push a single row of data.
    await dataset.push_data({'foo': 'bar'})

    # Push multiple rows of data (anything JSON-serializable can be pushed).
    await dataset.push_data([{'foo': 'bar2', 'col2': 'val2'}, {'col3': 123}])

    # Fetch all data from the dataset.
    data = await dataset.get_data()
    # Do something with it...

    # Remove the dataset.
    await dataset.drop()


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    # Create a new crawler (it can be any subclass of BasicCrawler).
    crawler = BeautifulSoupCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the (default) dataset.
        await context.push_data(data)

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])

    # Export the dataset to a file.
    await crawler.export_data(path='dataset.csv')


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset


async def main() -> None:
    # Open the dataset, if it does not exist, it will be created.
    # Leave name empty to use the default dataset.
    dataset = await Dataset.open(name='my-dataset')

    # Create a new crawler (it can be any subclass of BasicCrawler).
    crawler = BeautifulSoupCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push the extracted data to the dataset.
        await dataset.push_data(data)

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])

    # Export the dataset to the key-value store.
    await dataset.export_to(key='dataset', content_type='csv')


if __name__ == '__main__':
    asyncio.run(main())

Crawlee provides the following helper function to simplify interactions with the Dataset:

The push_data function allows you to manually add data to the dataset. You can optionally specify the dataset ID or its name.

Key-value store

The KeyValueStore is designed to save and retrieve data records or files efficiently. Each record is uniquely identified by a key and is associated with a specific MIME type, making the KeyValueStore ideal for tasks like saving web page screenshots, PDFs, or tracking the state of crawlers.

The following code demonstrates the usage of the KeyValueStore:

Basic usage
Usage with Crawler
Explicit usage with Crawler

Run on

import asyncio

from crawlee.storages import KeyValueStore


async def main() -> None:
    # Open the key-value store, if it does not exist, it will be created.
    # Leave name empty to use the default KVS.
    kvs = await KeyValueStore.open(name='my-key-value-store')

    # Set a value associated with 'some-key'.
    await kvs.set_value(key='some-key', value={'foo': 'bar'})

    # Get the value associated with 'some-key'.
    value = kvs.get_value('some-key')
    # Do something with it...

    # Delete the value associated with 'some-key' by setting it to None.
    await kvs.set_value(key='some-key', value=None)

    # Remove the key-value store.
    await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    # Create a new Playwright crawler.
    crawler = PlaywrightCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Capture the screenshot of the page using Playwright's API.
        screenshot = await context.page.screenshot()
        name = context.request.url.split('/')[-1]

        # Get the key-value store from the context. # If it does not exist,
        # it will be created. Leave name empty to use the default KVS.
        kvs = await context.get_key_value_store()

        # Store the screenshot in the key-value store.
        await kvs.set_value(
            key=f'screenshot-{name}',
            value=screenshot,
            content_type='image/png',
        )

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

Run on

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import KeyValueStore


async def main() -> None:
    # Open the key-value store, if it does not exist, it will be created.
    # Leave name empty to use the default KVS.
    kvs = await KeyValueStore.open(name='my-key-value-store')

    # Create a new Playwright crawler.
    crawler = PlaywrightCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Capture the screenshot of the page using Playwright's API.
        screenshot = await context.page.screenshot()
        name = context.request.url.split('/')[-1]

        # Store the screenshot in the key-value store.
        await kvs.set_value(
            key=f'screenshot-{name}',
            value=screenshot,
            content_type='image/png',
        )

    # Run the crawler with the initial URLs.
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

To see a real-world example of how to get the input from the key-value store, see the Screenshots example.

Crawlee provides the following helper function to simplify interactions with the KeyValueStore:

The get_key_value_store function retrieves the key-value store for the current crawler run. If the KVS does not exist, it will be created. You can also specify the KVS's ID or its name.

Cleaning up the storages

By default, Crawlee cleans up all unnamed storages (including the default one) at the start of each run, so every crawl begins with a clean state. This behavior is controlled by Configuration.purge_on_start (default: True). In contrast, named storages are never purged automatically and persist across runs. The exact behavior may vary depending on the storage client implementation.

When purging happens

The cleanup occurs as soon as a storage is accessed:

When opening a storage explicitly (e.g., RequestQueue.open, Dataset.open, KeyValueStore.open).
When using helper functions that implicitly open storages (e.g., push_data).
Automatically when BasicCrawler.run is invoked.

Disabling automatic purging

To disable automatic purging, set purge_on_start=False in your configuration:

Run on

import asyncio

from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    # Set the purge_on_start field to False to avoid purging the storage on start.
    configuration = Configuration(purge_on_start=False)

    # Pass the configuration to the crawler.
    crawler = HttpCrawler(configuration=configuration)

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Manual purging

Purge on start behavior just triggers the storage's purge method, which removes all data from the storage. If you want to purge the storage manually, you can do so by calling the purge method on the storage instance. Or if you want to delete the storage completely, you can call the drop method on the storage instance, which will remove the storage, including metadata and all its data.

Run on

import asyncio

from crawlee.storages import Dataset


async def main() -> None:
    # Create storage client with configuration
    dataset = await Dataset.open(name='my-dataset')

    # Purge the dataset explicitly - purging will remove all items from the dataset.
    # But keeps the dataset itself and its metadata.
    await dataset.purge()

    # Or you can drop the dataset completely, which will remove the dataset
    # and all its items.
    await dataset.drop()


if __name__ == '__main__':
    asyncio.run(main())

Note that purging behavior may vary between storage client implementations. For more details on storage configuration and client implementations, see the Storage clients guide.

Conclusion

This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned about the distinction between named storages (persistent across runs) and unnamed storages with aliases (temporary and purged on start). You discovered how to manage requests using the RequestQueue and store and retrieve scraping results using the Dataset and KeyValueStore. You also learned how to use helper functions to simplify interactions with these storages and how to control storage cleanup behavior.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Overview​

Named and unnamed storages​

Default storage​

Request queue​

Request-related helpers​

Request manager​

Dataset​

Dataset-related helpers​

Key-value store​

Key-value store-related helpers​

Cleaning up the storages​

When purging happens​

Disabling automatic purging​

Manual purging​

Conclusion​

Overview

Named and unnamed storages

Default storage

Request queue

Request-related helpers

Request manager

Dataset

Dataset-related helpers

Key-value store

Key-value store-related helpers

Cleaning up the storages

When purging happens

Disabling automatic purging

Manual purging

Conclusion