Request storage

This guide explains the different types of request storage available in Crawlee, how to store the requests that your crawler will process, and which storage type to choose based on your needs.

Introduction

All request storage types in Crawlee implement the same interface - RequestManager. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of BaseStorageClient. For instance, MemoryStorageClient stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure:

{CRAWLEE_STORAGE_DIR}/{request_provider}/{QUEUE_ID}/

note

Local directory is specified by the CRAWLEE_STORAGE_DIR environment variable with default value ./storage. {QUEUE_ID} is the name or ID of the specific request storage. The default value is default, unless we override it by setting the CRAWLEE_DEFAULT_REQUEST_QUEUE_ID environment variable.

Request queue

The RequestQueue is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a default request queue, which can be used to store URLs during a specific run. The RequestQueue is highly useful for large-scale and complex crawls.

The following code demonstrates the usage of the RequestQueue:

Basic usage
Usage with Crawler
Explicit usage with Crawler

import asyncio

from crawlee.storages import RequestQueue


async def main() -> None:
    # Open the request queue, if it does not exist, it will be created.
    # Leave name empty to use the default request queue.
    request_queue = await RequestQueue.open(name='my-request-queue')

    # Add a single request.
    await request_queue.add_request('https://apify.com/')

    # Add multiple requests as a batch.
    await request_queue.add_requests_batched(['https://crawlee.dev/', 'https://crawlee.dev/python/'])

    # Fetch and process requests from the queue.
    while request := await request_queue.fetch_next_request():
        # Do something with it...

        # And mark it as handled.
        await request_queue.mark_request_as_handled(request)

    # Remove the request queue.
    await request_queue.drop()


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    # Create a new crawler (it can be any subclass of BasicCrawler). Request queue is a default
    # request provider, it will be opened, and fully managed if not specified.
    crawler = HttpCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Use context's add_requests method helper to add new requests from the handler.
        await context.add_requests(['https://crawlee.dev/python/'])

    # Use crawler's add_requests method helper to add new requests.
    await crawler.add_requests(['https://apify.com/'])

    # Run the crawler. You can optionally pass the list of initial requests.
    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.storages import RequestQueue


async def main() -> None:
    # Open the request queue, if it does not exist, it will be created.
    # Leave name empty to use the default request queue.
    request_queue = await RequestQueue.open(name='my-request-queue')

    # Interact with the request queue directly, e.g. add a batch of requests.
    await request_queue.add_requests_batched(['https://apify.com/', 'https://crawlee.dev/'])

    # Create a new crawler (it can be any subclass of BasicCrawler) and pass the request
    # list as request provider to it. It will be managed by the crawler.
    crawler = HttpCrawler(request_manager=request_queue)

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    # And execute the crawler.
    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Request list

The RequestList is a simpler, lightweight storage option, used when all URLs to be crawled are known upfront. It represents the list of URLs to crawl that is stored in a crawler run memory (or optionally in default KeyValueStore associated with the run, if specified). The list is used for the crawling of a large number of URLs, when we know all the URLs which should be visited by the crawler and no URLs would be added during the run. The URLs can be provided either in code or parsed from a text file hosted on the web. The RequestList is typically created exclusively for a single crawler run, and its usage must be explicitly specified.

warning

The RequestList class is in its early version and is not fully implemented. It is currently intended mainly for testing purposes and small-scale projects. The current implementation is only in-memory storage and is very limited. It will be (re)implemented in the future. For more details, see the GitHub issue crawlee-python#99. For production usage we recommend to use the RequestQueue.

The following code demonstrates the usage of the RequestList:

Basic usage
Usage with Crawler

import asyncio

from crawlee.request_loaders import RequestList


async def main() -> None:
    # Open the request list, if it does not exist, it will be created.
    # Leave name empty to use the default request list.
    request_list = RequestList(
        name='my-request-list',
        requests=['https://apify.com/', 'https://crawlee.dev/', 'https://crawlee.dev/python/'],
    )

    # Fetch and process requests from the queue.
    while request := await request_list.fetch_next_request():
        # Do something with it...

        # And mark it as handled.
        await request_list.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.request_loaders import RequestList


async def main() -> None:
    # Open the request list, if it does not exist, it will be created.
    # Leave name empty to use the default request list.
    request_list = RequestList(
        name='my-request-list',
        requests=['https://apify.com/', 'https://crawlee.dev/'],
    )

    # Join the request list into a tandem with the default request queue
    request_manager = await request_list.to_tandem()

    # Create a new crawler (it can be any subclass of BasicCrawler) and pass the request manager tandem
    crawler = HttpCrawler(request_manager=request_manager)

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Use context's add_requests method helper to add new requests from the handler.
        await context.add_requests(['https://crawlee.dev/python/docs/quick-start'])

    # Use crawler's add_requests method helper to add new requests.
    await crawler.add_requests(['https://crawlee.dev/python/api'])

    # Run the crawler. You can optionally pass the list of initial requests.
    await crawler.run(['https://crawlee.dev/python/'])


if __name__ == '__main__':
    asyncio.run(main())

Processing requests from multiple sources

In some cases, you might need to combine requests from multiple sources, most frequently from a static list of URLs (such as RequestList) and a RequestQueue, where the queue takes care of persistence and retrying failed requests.

This use case is supported via the RequestManagerTandem class. You may also use the RequestLoader.to_tandem method as a shortcut.

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList


async def main() -> None:
    # Create a static request list
    request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

    crawler = ParselCrawler(
        # Requests from the list will be processed first, but they will be enqueued in the default request queue first
        request_manager=await request_list.to_tandem(),
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        await context.enqueue_links()  # New links will be enqueued directly to the queue

    await crawler.run()


asyncio.run(main())

Using to_tandem helper
Explicitly using RequestManagerTandem

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList


async def main() -> None:
    # Create a static request list
    request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

    crawler = ParselCrawler(
        # Requests from the list will be processed first, but they will be enqueued in the default request queue first
        request_manager=await request_list.to_tandem(),
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        await context.enqueue_links()  # New links will be enqueued directly to the queue

    await crawler.run()


asyncio.run(main())

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList, RequestManagerTandem
from crawlee.storages import RequestQueue


async def main() -> None:
    # Create a static request list
    request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

    # Open the default request queue
    request_queue = await RequestQueue.open()

    crawler = ParselCrawler(
        # Requests from the list will be processed first, but they will be enqueued in the default request queue first
        request_manager=RequestManagerTandem(request_list, request_queue),
    )

    @crawler.router.default_handler
    async def handler(context: ParselCrawlingContext) -> None:
        await context.enqueue_links()  # New links will be enqueued directly to the queue

    await crawler.run()


asyncio.run(main())

We offer several helper functions to simplify interactions with request storages:

The add_requests function allows you to manually add specific URLs to the configured request storage. In this case, you must explicitly provide the URLs you want to be added to the request storage. If you need to specify further details of the request, such as a label or user_data, you have to pass instances of the Request class to the helper.
The enqueue_links function is designed to discover new URLs in the current page and add them to the request storage. It can be used with default settings, requiring no arguments, or you can customize its behavior by specifying link element selectors, choosing different enqueue strategies, or applying include/exclude filters to control which URLs are added. See Crawl website with relative links example for more details.

Add requests
Enqueue links

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.add_requests(['https://apify.com/'])

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Cleaning up the storages

Default storages are purged before the crawler starts, unless explicitly configured otherwise. For that case, see Configuration.purge_on_start. This cleanup happens as soon as a storage is accessed, either when you open a storage (e.g. using RequestQueue.open) or when interacting with a storage through one of the helper functions (e.g. add_requests or enqueue_links, which implicitly opens the request storage).

import asyncio

from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    config = Configuration(purge_on_start=False)
    crawler = HttpCrawler(configuration=config)

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

If you do not explicitly interact with storages in your code, the purging will occur automatically when the BasicCrawler.run method is invoked.

If you need to purge storages earlier, you can call MemoryStorageClient.purge_on_start directly. This method triggers the purging process for the underlying storage implementation you are currently using.

import asyncio

from crawlee.storage_clients import MemoryStorageClient

async def main() -> None:
    storage_client = MemoryStorageClient.from_config()
    await storage_client.purge_on_start()

if __name__ == '__main__':
    asyncio.run(main())

Introduction​

Request queue​

Request list​

Processing requests from multiple sources​

Request-related helpers​

Cleaning up the storages​

Introduction

Request queue

Request list

Processing requests from multiple sources

Request-related helpers

Cleaning up the storages