Storages
Crawlee offers multiple storage types for managing and persisting your crawling data. Request-oriented storages, such as the RequestQueue
, help you store and deduplicate URLs, while result-oriented storages, like Dataset
and KeyValueStore
, focus on storing and retrieving scraping results. This guide helps you choose the storage type that suits your needs.
Storage clientsโ
Storage clients in Crawlee are subclasses of BaseStorageClient
. They handle interactions with different storage backends. For instance:
MemoryStorageClient
: Stores data in memory and persists it to the local file system.ApifyStorageClient
: Manages storage on the Apify Platform. Apify storage client is implemented in the Apify SDK.
Each storage client is responsible for maintaining the storages in a specific environment. This abstraction makes it easier to switch between different environments, e.g. between local development and cloud production setup.
Memory storage clientโ
The MemoryStorageClient
is the default and currently the only one storage client in Crawlee. It stores data in memory and persists it to the local file system. The data are stored in the following directory structure:
{CRAWLEE_STORAGE_DIR}/{storage_type}/{STORAGE_ID}/
where:
{CRAWLEE_STORAGE_DIR}
: The root directory for local storage, specified by theCRAWLEE_STORAGE_DIR
environment variable (default:./storage
).{storage_type}
: The type of storage (e.g.,datasets
,key_value_stores
,request_queues
).{STORAGE_ID}
: The ID of the specific storage instance (default:default
).
The current MemoryStorageClient
and its interface is quite old and not great. We plan to refactor it, together with the whole BaseStorageClient
interface in the near future and it better and and easier to use. We also plan to introduce new storage clients for different storage backends - e.g. for SQLLite.
You can override default storage IDs using these environment variables: CRAWLEE_DEFAULT_DATASET_ID
, CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID
, or CRAWLEE_DEFAULT_REQUEST_QUEUE_ID
.
Request queueโ
The RequestQueue
is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a default request queue, which can be used to store URLs during a specific run. The RequestQueue
is highly useful for large-scale and complex crawls.
By default, data are stored using the following path structure:
{CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/{INDEX}.json
{CRAWLEE_STORAGE_DIR}
: The root directory for all storage data, specified by the environment variable.{QUEUE_ID}
: The ID of the request queue, "default" by default.{INDEX}
: Represents the zero-based index of the record within the queue.
The following code demonstrates the usage of the RequestQueue
:
- Basic usage
- Usage with Crawler
- Explicit usage with Crawler
import asyncio
from crawlee.storages import RequestQueue
async def main() -> None:
# Open the request queue, if it does not exist, it will be created.
# Leave name empty to use the default request queue.
request_queue = await RequestQueue.open(name='my-request-queue')
# Add a single request.
await request_queue.add_request('https://apify.com/')
# Add multiple requests as a batch.
await request_queue.add_requests_batched(['https://crawlee.dev/', 'https://crawlee.dev/python/'])
# Fetch and process requests from the queue.
while request := await request_queue.fetch_next_request():
# Do something with it...
# And mark it as handled.
await request_queue.mark_request_as_handled(request)
# Remove the request queue.
await request_queue.drop()
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
async def main() -> None:
# Create a new crawler (it can be any subclass of BasicCrawler). Request queue is a default
# request provider, it will be opened, and fully managed if not specified.
crawler = HttpCrawler()
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Use context's add_requests method helper to add new requests from the handler.
await context.add_requests(['https://crawlee.dev/python/'])
# Use crawler's add_requests method helper to add new requests.
await crawler.add_requests(['https://apify.com/'])
# Run the crawler. You can optionally pass the list of initial requests.
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.storages import RequestQueue
async def main() -> None:
# Open the request queue, if it does not exist, it will be created.
# Leave name empty to use the default request queue.
request_queue = await RequestQueue.open(name='my-request-queue')
# Interact with the request queue directly, e.g. add a batch of requests.
await request_queue.add_requests_batched(['https://apify.com/', 'https://crawlee.dev/'])
# Create a new crawler (it can be any subclass of BasicCrawler) and pass the request
# list as request provider to it. It will be managed by the crawler.
crawler = HttpCrawler(request_manager=request_queue)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# And execute the crawler.
await crawler.run()
if __name__ == '__main__':
asyncio.run(main())
Request-related helpersโ
Crawlee provides helper functions to simplify interactions with the RequestQueue
:
- The
add_requests
function allows you to manually add specific URLs to the configured request storage. In this case, you must explicitly provide the URLs you want to be added to the request storage. If you need to specify further details of the request, such as alabel
oruser_data
, you have to pass instances of theRequest
class to the helper. - The
enqueue_links
function is designed to discover new URLs in the current page and add them to the request storage. It can be used with default settings, requiring no arguments, or you can customize its behavior by specifying link element selectors, choosing different enqueue strategies, or applying include/exclude filters to control which URLs are added. See Crawl website with relative links example for more details.
- Add requests
- Enqueue links
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
await context.add_requests(['https://apify.com/'])
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
await context.enqueue_links()
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Request managerโ
The RequestQueue
implements the RequestManager
interface, offering a unified API for interacting with various request storage types. This provides a unified way to interact with different request storage types.
If you need custom functionality, you can create your own request storage by subclassing the RequestManager
class and implementing its required methods.
For a detailed explanation of the RequestManager
and other related components, refer to the Request loaders guide.
Datasetโ
The Dataset
is designed for storing structured data, where each entry has a consistent set of attributes, such as products in an online store or real estate listings. Think of a Dataset
as a table: each entry corresponds to a row, with attributes represented as columns. Datasets are append-only, allowing you to add new records but not modify or delete existing ones. Every Crawlee project run is associated with a default dataset, typically used to store results specific to that crawler execution. However, using this dataset is optional.
By default, data are stored using the following path structure:
{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json
{CRAWLEE_STORAGE_DIR}
: The root directory for all storage data specified by the environment variable.{DATASET_ID}
: The dataset's ID, "default" by default.{INDEX}
: Represents the zero-based index of the record within the dataset.
The following code demonstrates basic operations of the dataset:
- Basic usage
- Usage with Crawler
- Explicit usage with Crawler
import asyncio
from crawlee.storages import Dataset
async def main() -> None:
# Open the dataset, if it does not exist, it will be created.
# Leave name empty to use the default dataset.
dataset = await Dataset.open()
# Push a single row of data.
await dataset.push_data({'foo': 'bar'})
# Push multiple rows of data (anything JSON-serializable can be pushed).
await dataset.push_data([{'foo': 'bar2', 'col2': 'val2'}, {'col3': 123}])
# Fetch all data from the dataset.
data = await dataset.get_data()
# Do something with it...
# Remove the dataset.
await dataset.drop()
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
# Create a new crawler (it can be any subclass of BasicCrawler).
crawler = BeautifulSoupCrawler()
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
# Push the extracted data to the (default) dataset.
await context.push_data(data)
# Run the crawler with the initial URLs.
await crawler.run(['https://crawlee.dev'])
# Export the dataset to a file.
await crawler.export_data(path='dataset.csv')
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
async def main() -> None:
# Open the dataset, if it does not exist, it will be created.
# Leave name empty to use the default dataset.
dataset = await Dataset.open()
# Create a new crawler (it can be any subclass of BasicCrawler).
crawler = BeautifulSoupCrawler()
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
# Push the extracted data to the dataset.
await dataset.push_data(data)
# Run the crawler with the initial URLs.
await crawler.run(['https://crawlee.dev'])
# Export the dataset to the key-value store.
await dataset.export_to(key='dataset', content_type='csv')
if __name__ == '__main__':
asyncio.run(main())
Dataset-related helpersโ
Crawlee provides the following helper function to simplify interactions with the Dataset
:
- The
push_data
function allows you to manually add data to the dataset. You can optionally specify the dataset ID or its name.
Key-value storeโ
The KeyValueStore
is designed to save and retrieve data records or files efficiently. Each record is uniquely identified by a key and is associated with a specific MIME type, making the KeyValueStore
ideal for tasks like saving web page screenshots, PDFs, or tracking the state of crawlers.
By default, data are stored using the following path structure:
{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT}
{CRAWLEE_STORAGE_DIR}
: The root directory for all storage data specified by the environment variable.{STORE_ID}
: The KVS's ID, "default" by default.{KEY}
: The unique key for the record.{EXT}
: The file extension corresponding to the MIME type of the content.
The following code demonstrates the usage of the KeyValueStore
:
- Basic usage
- Usage with Crawler
- Explicit usage with Crawler
import asyncio
from crawlee.storages import KeyValueStore
async def main() -> None:
# Open the key-value store, if it does not exist, it will be created.
# Leave name empty to use the default KVS.
kvs = await KeyValueStore.open()
# Set a value associated with 'some-key'.
await kvs.set_value(key='some-key', value={'foo': 'bar'})
# Get the value associated with 'some-key'.
value = kvs.get_value('some-key')
# Do something with it...
# Delete the value associated with 'some-key' by setting it to None.
await kvs.set_value(key='some-key', value=None)
# Remove the key-value store.
await kvs.drop()
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
# Create a new Playwright crawler.
crawler = PlaywrightCrawler()
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Capture the screenshot of the page using Playwright's API.
screenshot = await context.page.screenshot()
name = context.request.url.split('/')[-1]
# Get the key-value store from the context. # If it does not exist,
# it will be created. Leave name empty to use the default KVS.
kvs = await context.get_key_value_store()
# Store the screenshot in the key-value store.
await kvs.set_value(
key=f'screenshot-{name}',
value=screenshot,
content_type='image/png',
)
# Run the crawler with the initial URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import KeyValueStore
async def main() -> None:
# Open the key-value store, if it does not exist, it will be created.
# Leave name empty to use the default KVS.
kvs = await KeyValueStore.open()
# Create a new Playwright crawler.
crawler = PlaywrightCrawler()
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Capture the screenshot of the page using Playwright's API.
screenshot = await context.page.screenshot()
name = context.request.url.split('/')[-1]
# Store the screenshot in the key-value store.
await kvs.set_value(
key=f'screenshot-{name}',
value=screenshot,
content_type='image/png',
)
# Run the crawler with the initial URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
To see a real-world example of how to get the input from the key-value store, see the Screenshots example.
Key-value store-related helpersโ
Crawlee provides the following helper function to simplify interactions with the KeyValueStore
:
- The
get_key_value_store
function retrieves the key-value store for the current crawler run. If the KVS does not exist, it will be created. You can also specify the KVS's ID or its name.
Cleaning up the storagesโ
Default storages are purged before the crawler starts, unless explicitly configured otherwise. For that case, see Configuration.purge_on_start
. This cleanup happens as soon as a storage is accessed, either when you open a storage (e.g. using RequestQueue.open
, Dataset.open
, KeyValueStore.open
) or when interacting with a storage through one of the helper functions (e.g. push_data
), which implicitly opens the result storage.
import asyncio
from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
async def main() -> None:
# Set the purge_on_start field to False to avoid purging the storage on start.
configuration = Configuration(purge_on_start=False)
# Pass the configuration to the crawler.
crawler = HttpCrawler(configuration=configuration)
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
If you do not explicitly interact with storages in your code, the purging will occur automatically when the BasicCrawler.run
method is invoked.
If you need to purge storages earlier, you can call MemoryStorageClient.purge_on_start
directly if you are using the default storage client. This method triggers the purging process for the underlying storage implementation you are currently using.
import asyncio
from crawlee.crawlers import HttpCrawler
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient.from_config()
# Call the purge_on_start method to explicitly purge the storage.
await storage_client.purge_on_start()
# Pass the storage client to the crawler.
crawler = HttpCrawler(storage_client=storage_client)
# ...
if __name__ == '__main__':
asyncio.run(main())
Conclusionโ
This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned how to manage requests and store and retrieve scraping results using the RequestQueue
, Dataset
, and KeyValueStore
. You also discovered how to use helper functions to simplify interactions with these storages. Finally, you learned how to clean up storages before starting a crawler run and how to purge them explicitly. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!