Skip to main content

Request storage

This guide explains the different types of request storage available in Crawlee, how to store the requests that your crawler will process, and which storage type to choose based on your needs.

Introduction

All request storage types in Crawlee implement the same interface - RequestManager. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of BaseStorageClient. For instance, MemoryStorageClient stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure:

{CRAWLEE_STORAGE_DIR}/{request_provider}/{QUEUE_ID}/
note

Local directory is specified by the CRAWLEE_STORAGE_DIR environment variable with default value ./storage. {QUEUE_ID} is the name or ID of the specific request storage. The default value is default, unless we override it by setting the CRAWLEE_DEFAULT_REQUEST_QUEUE_ID environment variable.

Request queue

The RequestQueue is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a default request queue, which can be used to store URLs during a specific run. The RequestQueue is highly useful for large-scale and complex crawls.

The following code demonstrates the usage of the RequestQueue:

import asyncio

from crawlee.storages import RequestQueue


async def main() -> None:
# Open the request queue, if it does not exist, it will be created.
# Leave name empty to use the default request queue.
request_queue = await RequestQueue.open(name='my-request-queue')

# Add a single request.
await request_queue.add_request('https://apify.com/')

# Add multiple requests as a batch.
await request_queue.add_requests_batched(['https://crawlee.dev/', 'https://crawlee.dev/python/'])

# Fetch and process requests from the queue.
while request := await request_queue.fetch_next_request():
# Do something with it...

# And mark it as handled.
await request_queue.mark_request_as_handled(request)

# Remove the request queue.
await request_queue.drop()


if __name__ == '__main__':
asyncio.run(main())

Request list

The RequestList is a simpler, lightweight storage option, used when all URLs to be crawled are known upfront. It represents the list of URLs to crawl that is stored in a crawler run memory (or optionally in default KeyValueStore associated with the run, if specified). The list is used for the crawling of a large number of URLs, when we know all the URLs which should be visited by the crawler and no URLs would be added during the run. The URLs can be provided either in code or parsed from a text file hosted on the web. The RequestList is typically created exclusively for a single crawler run, and its usage must be explicitly specified.

warning

The RequestList class is in its early version and is not fully implemented. It is currently intended mainly for testing purposes and small-scale projects. The current implementation is only in-memory storage and is very limited. It will be (re)implemented in the future. For more details, see the GitHub issue crawlee-python#99. For production usage we recommend to use the RequestQueue.

The following code demonstrates the usage of the RequestList:

import asyncio

from crawlee.request_loaders import RequestList


async def main() -> None:
# Open the request list, if it does not exist, it will be created.
# Leave name empty to use the default request list.
request_list = RequestList(
name='my-request-list',
requests=['https://apify.com/', 'https://crawlee.dev/', 'https://crawlee.dev/python/'],
)

# Fetch and process requests from the queue.
while request := await request_list.fetch_next_request():
# Do something with it...

# And mark it as handled.
await request_list.mark_request_as_handled(request)


if __name__ == '__main__':
asyncio.run(main())

Processing requests from multiple sources

In some cases, you might need to combine requests from multiple sources, most frequently from a static list of URLs (such as RequestList) and a RequestQueue, where the queue takes care of persistence and retrying failed requests.

This use case is supported via the RequestManagerTandem class. You may also use the RequestLoader.to_tandem method as a shortcut.

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList


async def main() -> None:
# Create a static request list
request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

crawler = ParselCrawler(
# Requests from the list will be processed first, but they will be enqueued in the default request queue first
request_manager=await request_list.to_tandem(),
)

@crawler.router.default_handler
async def handler(context: ParselCrawlingContext) -> None:
await context.enqueue_links() # New links will be enqueued directly to the queue

await crawler.run()


asyncio.run(main())
import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.request_loaders import RequestList


async def main() -> None:
# Create a static request list
request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])

crawler = ParselCrawler(
# Requests from the list will be processed first, but they will be enqueued in the default request queue first
request_manager=await request_list.to_tandem(),
)

@crawler.router.default_handler
async def handler(context: ParselCrawlingContext) -> None:
await context.enqueue_links() # New links will be enqueued directly to the queue

await crawler.run()


asyncio.run(main())

We offer several helper functions to simplify interactions with request storages:

  • The add_requests function allows you to manually add specific URLs to the configured request storage. In this case, you must explicitly provide the URLs you want to be added to the request storage. If you need to specify further details of the request, such as a label or user_data, you have to pass instances of the Request class to the helper.
  • The enqueue_links function is designed to discover new URLs in the current page and add them to the request storage. It can be used with default settings, requiring no arguments, or you can customize its behavior by specifying link element selectors, choosing different enqueue strategies, or applying include/exclude filters to control which URLs are added. See Crawl website with relative links example for more details.
import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
await context.add_requests(['https://apify.com/'])

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

Cleaning up the storages

Default storages are purged before the crawler starts, unless explicitly configured otherwise. For that case, see Configuration.purge_on_start. This cleanup happens as soon as a storage is accessed, either when you open a storage (e.g. using RequestQueue.open) or when interacting with a storage through one of the helper functions (e.g. add_requests or enqueue_links, which implicitly opens the request storage).

import asyncio

from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext


async def main() -> None:
config = Configuration(purge_on_start=False)
crawler = HttpCrawler(configuration=config)

@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())

If you do not explicitly interact with storages in your code, the purging will occur automatically when the BasicCrawler.run method is invoked.

If you need to purge storages earlier, you can call MemoryStorageClient.purge_on_start directly. This method triggers the purging process for the underlying storage implementation you are currently using.

import asyncio

from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
storage_client = MemoryStorageClient.from_config()
await storage_client.purge_on_start()


if __name__ == '__main__':
asyncio.run(main())