Skip to main content
Version: 3.7

Result Storage

Crawlee has several result storage types that are useful for specific tasks. The data is stored on a local disk to the directory defined by the CRAWLEE_STORAGE_DIR environment variable. If this variable is not defined, by default Crawlee sets CRAWLEE_STORAGE_DIR to ./storage in the current working directory.

Crawlee storage is managed by MemoryStorage class. During the crawler run all information is stored in memory, while also being off-loaded to the local files in respective storage type folders.

Key-value store

The key-value store is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots of web pages, PDFs or to persist the state of crawlers.

Each Crawlee project run is associated with a default key-value store. By convention, the project input and output are stored in the default key-value store under the INPUT and OUTPUT keys respectively. Typically, both input and output are JSON files, although they could be any other format.

In Crawlee, the key-value store is represented by the KeyValueStore class. In order to simplify access to the default key-value store, Crawlee also provides KeyValueStore.getValue() and KeyValueStore.setValue() functions.

The data is stored in the directory specified by the CRAWLEE_STORAGE_DIR environment variable as follows:

{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT}
note

{STORE_ID} is the name or the ID of the key-value store. The default key-value store has ID default, unless we override it by setting the CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID environment variable. The {KEY} is the key of the record and {EXT} corresponds to the MIME content type of the data value.

The following code demonstrates basic operations of key-value stores:

import { KeyValueStore } from 'crawlee';

// Get the INPUT from the default key-value store
const input = await KeyValueStore.getInput();

// Write the OUTPUT to the default key-value store
await KeyValueStore.setValue('OUTPUT', { myResult: 123 });

// Open a named key-value store
const store = await KeyValueStore.open('some-name');

// Write a record to the named key-value store.
// JavaScript object is automatically converted to JSON,
// strings and binary buffers are stored as they are
await store.setValue('some-key', { foo: 'bar' });

// Read a record from the named key-value store.
// Note that JSON is automatically parsed to a JavaScript object,
// text data is returned as a string, and other data is returned as binary buffer
const value = await store.getValue('some-key');

// Delete a record from the named key-value store
await store.setValue('some-key', null);

To see a real-world example of how to get the input from the key-value store, see the Screenshots example.

Dataset

Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. Dataset can be imagined as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - we can only add new records to it, but we cannot modify or remove existing records.

Each Crawlee project run is associated with a default dataset. Typically, it is used to store crawling results specific for the crawler run. Its usage is optional.

In Crawlee, the dataset is represented by the Dataset class. In order to simplify writes to the default dataset, Crawlee also provides the Dataset.pushData() function.

The data is stored in the directory specified by the CRAWLEE_STORAGE_DIR environment variable as follows:

{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json
note

{DATASET_ID} is the name or the ID of the dataset. The default dataset has ID default, unless we override it by setting the CRAWLEE_DEFAULT_DATASET_ID environment variable. Each dataset item is stored as a separate JSON file, where {INDEX} is a zero-based index of the item in the dataset.

The following code demonstrates basic operations of the dataset:

import { Dataset } from 'crawlee';

// Write a single row to the default dataset
await Dataset.pushData({ col1: 123, col2: 'val2' });

// Open a named dataset
const dataset = await Dataset.open('some-name');

// Write a single row
await dataset.pushData({ foo: 'bar' });

// Write multiple rows
await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]);

To see how to use the dataset to store crawler results, see the Cheerio Crawler example.

Cleaning up the storages

Default storages are purged before the crawler starts if not specified otherwise. This happens as early as when we try to open some storage (e.g. via Dataset.open()) or when we try to work with a default storage via one of the helper methods (e.g. Dataset.pushData() that under the hood calls Dataset.open()). If we don't work with storages explicitly in our code, the purging will eventually happen when the run method of our crawler is executed. In case we need to purge the storages sooner, we can use the purgeDefaultStorages() helper explicitly:

import { purgeDefaultStorages } from 'crawlee';

await purgeDefaultStorages();

Calling this function will clean up the default results storage directories except the INPUT key in default key-value store directory. This is a shortcut for running (optional) purge method on the StorageClient interface, in other words it will call the purge method of the underlying storage implementation we are currently using. In addition, this method will make sure the storage is purged only once for a given execution context, so it is safe to call it multiple times.