Skip to main content

FileSystemDatasetClient

File system implementation of the dataset client.

This client persists dataset items to the file system as individual JSON files within a structured directory hierarchy following the pattern:

{STORAGE_DIR}/datasets/{DATASET_ID}/{ITEM_ID}.json

Each item is stored as a separate file, which allows for durability and the ability to recover after process termination. Dataset operations like filtering, sorting, and pagination are implemented by processing the stored files according to the requested parameters.

This implementation is ideal for long-running crawlers where data persistence is important, and for development environments where you want to easily inspect the collected data between runs.

Hierarchy

Index

Methods

__init__

  • __init__(*, metadata, storage_dir, lock): None
  • Initialize a new instance.

    Preferably use the FileSystemDatasetClient.open class method to create a new instance.


    Parameters

    • keyword-onlymetadata: DatasetMetadata
    • keyword-onlystorage_dir: Path
    • keyword-onlylock: asyncio.Lock

    Returns None

drop

  • async drop(): None
  • Drop the whole dataset and remove all its items.

    The backend method for the Dataset.drop call.


    Returns None

get_data

  • async get_data(*, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden, flatten, view): DatasetItemsListPage
  • Get data from the dataset with various filtering options.

    The backend method for the Dataset.get_data call.


    Parameters

    • optionalkeyword-onlyoffset: int = 0
    • optionalkeyword-onlylimit: int | None = 999_999_999_999
    • optionalkeyword-onlyclean: bool = False
    • optionalkeyword-onlydesc: bool = False
    • optionalkeyword-onlyfields: list[str] | None = None
    • optionalkeyword-onlyomit: list[str] | None = None
    • optionalkeyword-onlyunwind: str | None = None
    • optionalkeyword-onlyskip_empty: bool = False
    • optionalkeyword-onlyskip_hidden: bool = False
    • optionalkeyword-onlyflatten: list[str] | None = None
    • optionalkeyword-onlyview: str | None = None

    Returns DatasetItemsListPage

get_metadata

iterate_items

  • async iterate_items(*, offset, limit, clean, desc, fields, omit, unwind, skip_empty, skip_hidden): AsyncIterator[dict[str, Any]]
  • Iterate over the dataset items with filtering options.

    The backend method for the Dataset.iterate_items call.


    Parameters

    • optionalkeyword-onlyoffset: int = 0
    • optionalkeyword-onlylimit: int | None = None
    • optionalkeyword-onlyclean: bool = False
    • optionalkeyword-onlydesc: bool = False
    • optionalkeyword-onlyfields: list[str] | None = None
    • optionalkeyword-onlyomit: list[str] | None = None
    • optionalkeyword-onlyunwind: str | None = None
    • optionalkeyword-onlyskip_empty: bool = False
    • optionalkeyword-onlyskip_hidden: bool = False

    Returns AsyncIterator[dict[str, Any]]

open

  • Open or create a file system dataset client.

    This method attempts to open an existing dataset from the file system. If a dataset with the specified ID or name exists, it loads the metadata from the stored files. If no existing dataset is found, a new one is created.


    Parameters

    • keyword-onlyid: str | None

      The ID of the dataset to open. If provided, searches for existing dataset by ID.

    • keyword-onlyname: str | None

      The name of the dataset to open. If not provided, uses the default dataset.

    • keyword-onlyconfiguration: Configuration

      The configuration object containing storage directory settings.

    Returns FileSystemDatasetClient

purge

  • async purge(): None
  • Purge all items from the dataset.

    The backend method for the Dataset.purge call.


    Returns None

push_data

  • async push_data(data): None
  • Push data to the dataset.

    The backend method for the Dataset.push_data call.


    Parameters

    • data: list[Any] | dict[str, Any]

    Returns None

Properties

path_to_dataset

path_to_dataset: Path

The full path to the dataset directory.

path_to_metadata

path_to_metadata: Path

The full path to the dataset metadata file.