Skip to main content

Dataset

Represents an append-only structured storage, ideal for tabular data similar to database tables.

The Dataset class is designed to store structured data, where each entry (row) maintains consistent attributes (columns) across the dataset. It operates in an append-only mode, allowing new records to be added, but not modified or deleted. This makes it particularly useful for storing results from web crawling operations.

Data can be stored either locally or in the cloud. It depends on the setup of underlying storage client. By default a MemoryStorageClient is used, but it can be changed to a different one.

By default, data is stored using the following path structure:

{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json
  • {CRAWLEE_STORAGE_DIR}: The root directory for all storage data specified by the environment variable.
  • {DATASET_ID}: Specifies the dataset, either "default" or a custom dataset ID.
  • {INDEX}: Represents the zero-based index of the record within the dataset.

To open a dataset, use the open class method by specifying an id, name, or configuration. If none are provided, the default dataset for the current crawler run is used. Attempting to open a dataset by id that does not exist will raise an error; however, if accessed by name, the dataset will be created if it doesn't already exist.

Usage

from crawlee.storages import Dataset

dataset = await Dataset.open(name='my_dataset')

Hierarchy

Index

Methods

__init__

  • __init__(*, id, name, storage_client): None
  • Parameters

    • optionalkeyword-onlyid: str
    • optionalkeyword-onlyname: str | None
    • optionalkeyword-onlystorage_client: BaseStorageClient

    Returns None

check_and_serialize

  • async check_and_serialize(*, item, index): str
  • Serializes a given item to JSON, checks its serializability and size against a limit.


    Parameters

    • optionalkeyword-onlyitem: JsonSerializable

      The item to serialize.

    • optionalkeyword-onlyindex: int | None = None

      Index of the item, used for error context.

    Returns str

drop

  • async drop(): None
  • Drop the storage, removing it from the underlying storage client and clearing the cache.


    Returns None

export_to

  • async export_to(*, kwargs): None
  • Exports the entire dataset into a specified file stored under a key in a key-value store.

    This method consolidates all entries from a specified dataset into one file, which is then saved under a given key in a key-value store. The format of the exported file is determined by the content_type parameter. Either the dataset's ID or name should be specified, and similarly, either the target key-value store's ID or name should be used.


    Parameters

    • keyword-onlykey: Required[str]

      The key under which to save the data.

    • keyword-onlycontent_type: Literal[json, csv]

      The format in which to export the data. Either 'json' or 'csv'.

    • keyword-onlyto_key_value_store_id: str

      ID of the key-value store to save the exported file.

    • keyword-onlyto_key_value_store_name: str

      Name of the key-value store to save the exported file.

    Returns None

get_data

  • Retrieves dataset items based on filtering, sorting, and pagination parameters.

    This method allows customization of the data retrieval process from a dataset, supporting operations such as field selection, ordering, and skipping specific records based on provided parameters.


    Parameters

    • keyword-onlyoffset: int

      Skips the specified number of items at the start.

    • keyword-onlylimit: int

      The maximum number of items to retrieve. Unlimited if None.

    • keyword-onlyclean: bool

      Return only non-empty items and excludes hidden fields. Shortcut for skip_hidden and skip_empty.

    • keyword-onlydesc: bool

      Set to True to sort results in descending order.

    • keyword-onlyfields: list[str]

      Fields to include in each item. Sorts fields as specified if provided.

    • keyword-onlyomit: list[str]

      Fields to exclude from each item.

    • keyword-onlyunwind: str

      Unwinds items by a specified array field, turning each element into a separate item.

    • keyword-onlyskip_empty: bool

      Excludes empty items from the results if True.

    • keyword-onlyskip_hidden: bool

      Excludes fields starting with '#' if True.

    • keyword-onlyflatten: list[str]

      Fields to be flattened in returned items.

    • keyword-onlyview: str

      Specifies the dataset view to be used.

    Returns DatasetItemsListPage

get_info

  • Get an object containing general information about the dataset.


    Returns DatasetMetadata | None

iterate_items

  • async iterate_items(*, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden): AsyncIterator[dict]
  • Iterates over dataset items, applying filtering, sorting, and pagination.

    Retrieves dataset items incrementally, allowing fine-grained control over the data fetched. The function supports various parameters to filter, sort, and limit the data returned, facilitating tailored dataset queries.


    Parameters

    • optionalkeyword-onlyoffset: int = 0

      Initial number of items to skip.

    • optionalkeyword-onlylimit: int | None = None

      Max number of items to return. No limit if None.

    • optionalkeyword-onlyclean: bool = False

      Filters out empty items and hidden fields if True.

    • optionalkeyword-onlydesc: bool = False

      Returns items in reverse order if True.

    • optionalkeyword-onlyfields: list[str] | None = None

      Specific fields to include in each item.

    • optionalkeyword-onlyomit: list[str] | None = None

      Fields to omit from each item.

    • optionalkeyword-onlyunwind: str | None = None

      Field name to unwind items by.

    • optionalkeyword-onlyskip_empty: bool = False

      Omits empty items if True.

    • optionalkeyword-onlyskip_hidden: bool = False

      Excludes fields starting with '#' if True.

    Returns AsyncIterator[dict]

open

  • async open(*, id, name, configuration, storage_client): BaseStorage
  • Open a storage, either restore existing or create a new one.


    Parameters

    • optionalkeyword-onlyid: str | None = None

      The storage ID.

    • optionalkeyword-onlyname: str | None = None

      The storage name.

    • optionalkeyword-onlyconfiguration: Configuration | None = None

      Configuration object used during the storage creation or restoration process.

    • optionalkeyword-onlystorage_client: BaseStorageClient | None = None

      Underlying storage client to use. If not provided, the default global storage client from the service locator will be used.

    Returns BaseStorage

push_data

  • async push_data(*, data, kwargs): None
  • Store an object or an array of objects to the dataset.

    The size of the data is limited by the receiving API and therefore push_data() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.


    Parameters

    • optionalkeyword-onlydata: JsonSerializable

      A JSON serializable data structure to be stored in the dataset. The JSON representation of each item must be smaller than 9MB.

    Returns None

write_to_csv

  • async write_to_csv(*, destination, kwargs): None
  • Exports the entire dataset into an arbitrary stream.


    Parameters

    • optionalkeyword-onlydestination: TextIO

      The stream into which the dataset contents should be written.

    • keyword-onlydialect: str

      Specifies a dialect to be used in CSV parsing and writing.

    • keyword-onlydelimiter: str

      A one-character string used to separate fields. Defaults to ','.

    • keyword-onlydoublequote: bool

      Controls how instances of quotechar inside a field should be quoted. When True, the character is doubled; when False, the escapechar is used as a prefix. Defaults to True.

    • keyword-onlyescapechar: str

      A one-character string used to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. Defaults to None, disabling escaping.

    • keyword-onlylineterminator: str

      The string used to terminate lines produced by the writer. Defaults to '\r\n'.

    • keyword-onlyquotechar: str

      A one-character string used to quote fields containing special characters, like the delimiter or quotechar, or fields containing new-line characters. Defaults to '"'.

    • keyword-onlyquoting: int

      Controls when quotes should be generated by the writer and recognized by the reader. Can take any of the QUOTE_* constants, with a default of QUOTE_MINIMAL.

    • keyword-onlyskipinitialspace: bool

      When True, spaces immediately following the delimiter are ignored. Defaults to False.

    • keyword-onlystrict: bool

      When True, raises an exception on bad CSV input. Defaults to False.

    Returns None

write_to_json

  • async write_to_json(*, destination, kwargs): None
  • Exports the entire dataset into an arbitrary stream.


    Parameters

    • optionalkeyword-onlydestination: TextIO

      The stream into which the dataset contents should be written.

    • keyword-onlyskipkeys: bool

      If True (default: False), dict keys that are not of a basic type (str, int, float, bool, None) will be skipped instead of raising a TypeError.

    • keyword-onlyensure_ascii: bool

      Determines if non-ASCII characters should be escaped in the output JSON string.

    • keyword-onlycheck_circular: bool

      If False (default: True), skips the circular reference check for container types. A circular reference will result in a RecursionError or worse if unchecked.

    • keyword-onlyallow_nan: bool

      If False (default: True), raises a ValueError for out-of-range float values (nan, inf, -inf) to strictly comply with the JSON specification. If True, uses their JavaScript equivalents (NaN, Infinity, -Infinity).

    • keyword-onlycls: type[json.JSONEncoder]

      Allows specifying a custom JSON encoder.

    • keyword-onlyindent: int

      Specifies the number of spaces for indentation in the pretty-printed JSON output.

    • keyword-onlyseparators: tuple[str, str]

      A tuple of (item_separator, key_separator). The default is (', ', ': ') if indent is None and (',', ': ') otherwise.

    • keyword-onlydefault: Callable

      A function called for objects that can't be serialized otherwise. It should return a JSON-encodable version of the object or raise a TypeError.

    • keyword-onlysort_keys: bool

      Specifies whether the output JSON object should have keys sorted alphabetically.

    Returns None

Properties

id

id: str

Get the storage ID.

name

name: str | None

Get the storage name.