Skip to main content

Dataset

crawlee.storages.dataset.Dataset

Represents an append-only structured storage, ideal for tabular data akin to database tables.

Represents a structured data store similar to a table, where each object (row) has consistent attributes (columns). Datasets operate on an append-only basis, allowing for the addition of new records without the modification or removal of existing ones. This class is typically used for storing crawling results.

Data can be stored locally or in the cloud, with local storage paths formatted as: {CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json. Here, {DATASET_ID} is either "default" or a specific dataset ID, and {INDEX} represents the zero-based index of the item in the dataset.

To open a dataset, use the open class method with an id, name, or config. If unspecified, the default dataset for the current crawler run is used. Opening a non-existent dataset by id raises an error, while by name, it is created.

Usage: dataset = await Dataset.open(id='my_dataset_id')

Index

Constructors

__init__

  • __init__(id, name, configuration, client): None
  • Parameters

    • id: str
    • name: str | None
    • configuration: Configuration
    • client: BaseStorageClient

    Returns None

Methods

drop

  • async drop(): None
  • Returns None

export_to

  • async export_to(kwargs): None
  • Exports the entire dataset into a specified file stored under a key in a key-value store.

    This method consolidates all entries from a specified dataset into one file, which is then saved under a given key in a key-value store. The format of the exported file is determined by the content_type parameter. Either the dataset's ID or name should be specified, and similarly, either the target key-value store's ID or name should be used.


    Parameters

    • kwargs: Unpack[ExportToKwargs]

    Returns None

get_data

  • async get_data(kwargs): DatasetItemsListPage
  • Retrieves dataset items based on filtering, sorting, and pagination parameters.

    This method allows customization of the data retrieval process from a dataset, supporting operations such as field selection, ordering, and skipping specific records based on provided parameters.


    Parameters

    • kwargs: Unpack[GetDataKwargs]

    Returns DatasetItemsListPage

get_info

  • async get_info(): DatasetMetadata | None
  • Get an object containing general information about the dataset.


    Returns DatasetMetadata | None

iterate_items

  • async iterate_items(*, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden): AsyncIterator[dict]
  • Iterates over dataset items, applying filtering, sorting, and pagination.

    Retrieves dataset items incrementally, allowing fine-grained control over the data fetched. The function supports various parameters to filter, sort, and limit the data returned, facilitating tailored dataset queries.


    Parameters

    • offset: int = 0keyword-only
    • limit: int | None = Nonekeyword-only
    • clean: bool = Falsekeyword-only
    • desc: bool = Falsekeyword-only
    • fields: list[str] | None = Nonekeyword-only
    • omit: list[str] | None = Nonekeyword-only
    • unwind: str | None = Nonekeyword-only
    • skip_empty: bool = Falsekeyword-only
    • skip_hidden: bool = Falsekeyword-only

    Returns AsyncIterator[dict]

open

  • async open(*, id, name, configuration): Dataset
  • Parameters

    • id: str | None = Nonekeyword-only
    • name: str | None = Nonekeyword-only
    • configuration: Configuration | None = Nonekeyword-only

    Returns Dataset

push_data

  • async push_data(data, kwargs): None
  • Store an object or an array of objects to the dataset.

    The size of the data is limited by the receiving API and therefore push_data() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.


    Parameters

    • data: JSONSerializable
    • kwargs: Unpack[PushDataKwargs]

    Returns None

write_to

  • async write_to(content_type, destination): None
  • Exports the entire dataset into an arbitrary stream.


    Parameters

    • content_type: Literal['json', 'csv']
    • destination: TextIO

    Returns None

Properties

id

id: str

name

name: str | None