Storage clients
Storage clients in Crawlee are subclasses of StorageClient
. They handle interactions with different storage backends. For instance:
MemoryStorageClient
: Stores data purely in memory with no persistence.FileSystemStorageClient
: Provides persistent file system storage with in-memory caching for better performance.ApifyStorageClient
: Manages storage on the Apify platform. Apify storage client is implemented in the Apify SDK. You will find more information about it in the Apify SDK documentation.
Each storage client is responsible for maintaining the storages in a specific environment. This abstraction makes it easier to switch between different environments, e.g. between local development and cloud production setup.
Storage clients provide a unified interface for interacting with Dataset
, KeyValueStore
, and RequestQueue
, regardless of the underlying storage implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup.
Built-in storage clients
Crawlee for Python currently provides two main storage client implementations:
Memory storage client
The MemoryStorageClient
stores all data in memory using Python data structures. It provides fast access but does not persist data between runs, meaning all data is lost when the program terminates.
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import MemoryStorageClient
# Create a new instance of storage client.
storage_client = MemoryStorageClient()
# And pass it to the crawler.
crawler = ParselCrawler(storage_client=storage_client)
The MemoryStorageClient
is a good choice for testing, development, short-lived operations where speed is more important than data persistence, or HTTP APIs where each request should be handled with a fresh storage. It is not suitable for production use or long-running crawls, as all data will be lost when the program exits.
The MemoryStorageClient
does not persist data between runs. All data is lost when the program terminates.
File system storage client
The FileSystemStorageClient
provides persistent storage by writing data directly to the file system. It uses smart caching and batch processing for better performance while storing data in human-readable JSON format. This is a default storage client used by Crawlee when no other storage client is specified.
The FileSystemStorageClient
is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time.
This storage client is ideal for large datasets, and long-running operations where data persistence is required. Data can be easily inspected and shared with other tools.
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import FileSystemStorageClient
# Create a new instance of storage client.
storage_client = FileSystemStorageClient()
# And pass it to the crawler.
crawler = ParselCrawler(storage_client=storage_client)
Configuration options for the FileSystemStorageClient
can be set through environment variables or the Configuration
class.
storage_dir
(env:CRAWLEE_STORAGE_DIR
, default:'./storage'
): The root directory for all storage data.purge_on_start
(env:CRAWLEE_PURGE_ON_START
, default:True
): Whether to purge default storages on start.
Data are stored using the following directory structure:
{CRAWLEE_STORAGE_DIR}/
├── datasets/
│ └── {DATASET_NAME}/
│ ├── __metadata__.json
│ ├── 000000001.json
│ └── 000000002.json
├── key_value_stores/
│ └── {KVS_NAME}/
│ ├── __metadata__.json
│ ├── key1.json
│ ├── key2.txt
│ └── key3.json
└── request_queues/
└── {RQ_NAME}/
├── __metadata__.json
├── {REQUEST_ID_1}.json
└── {REQUEST_ID_2}.json
Where:
{CRAWLEE_STORAGE_DIR}
: The root directory for local storage.{DATASET_NAME}
,{KVS_NAME}
,{RQ_NAME}
: The unique names for each storage instance (defaults to"default"
).- Files are stored directly without additional metadata files for simpler structure.
Here is an example of how to configure the FileSystemStorageClient
:
from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import FileSystemStorageClient
# Create a new instance of storage client.
storage_client = FileSystemStorageClient()
# Create a configuration with custom settings.
configuration = Configuration(
storage_dir='./my_storage',
purge_on_start=False,
)
# And pass them to the crawler.
crawler = ParselCrawler(
storage_client=storage_client,
configuration=configuration,
)
Creating a custom storage client
A storage client consists of two parts: the storage client factory and individual storage type clients. The StorageClient
acts as a factory that creates specific clients (DatasetClient
, KeyValueStoreClient
, RequestQueueClient
) where the actual storage logic is implemented.
Here is an example of a custom storage client that implements the StorageClient
interface:
from __future__ import annotations
from typing import TYPE_CHECKING
from crawlee.storage_clients import StorageClient
from crawlee.storage_clients._base import (
DatasetClient,
KeyValueStoreClient,
RequestQueueClient,
)
if TYPE_CHECKING:
from crawlee.configuration import Configuration
# Implement the storage type clients with your backend logic.
class CustomDatasetClient(DatasetClient):
# Implement methods like push_data, get_data, iterate_items, etc.
pass
class CustomKeyValueStoreClient(KeyValueStoreClient):
# Implement methods like get_value, set_value, delete, etc.
pass
class CustomRequestQueueClient(RequestQueueClient):
# Implement methods like add_request, fetch_next_request, etc.
pass
# Implement the storage client factory.
class CustomStorageClient(StorageClient):
async def create_dataset_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
) -> CustomDatasetClient:
# Create and return your custom dataset client.
pass
async def create_kvs_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
) -> CustomKeyValueStoreClient:
# Create and return your custom key-value store client.
pass
async def create_rq_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
) -> CustomRequestQueueClient:
# Create and return your custom request queue client.
pass
Custom storage clients can implement any storage logic, such as connecting to a database, using a cloud storage service, or integrating with other systems. They must implement the required methods for creating, reading, updating, and deleting data in the respective storages.
Registering storage clients
Storage clients can be registered either:
- Globally, with the
ServiceLocator
or passed directly to the crawler; - Or storage specific, when opening a storage instance like
Dataset
,KeyValueStore
, orRequestQueue
.
from crawlee.storage_clients import CustomStorageClient
from crawlee.service_locator import service_locator
from crawlee.crawlers import ParselCrawler
from crawlee.storages import Dataset
# Create custom storage client.
storage_client = CustomStorageClient()
storage_client = CustomStorageClient()
# Register it either with the service locator.
service_locator.set_storage_client(storage_client)
# Or pass it directly to the crawler.
crawler = ParselCrawler(storage_client=storage_client)
# Or just provide it when opening a storage (e.g. dataset).
dataset = await Dataset.open(
name='my_dataset',
storage_client=storage_client,
)
You can also register a different storage client for each storage instance, allowing you to use different backends for different storages. This is useful when you want to use for example a fast in-memory storage for RequestQueue
while persisting scraping results for Dataset
or KeyValueStore
.
Conclusion
Storage clients in Crawlee provide different backends for storages. Use MemoryStorageClient
for testing and fast operations without persistence, or FileSystemStorageClient
for environments where data needs to persist. You can also create custom storage clients for specialized backends by implementing the StorageClient
interface. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!