FileSystemDatasetClient
Hierarchy
- DatasetClient
- FileSystemDatasetClient
Index
Methods
__init__
Initialize a new instance.
Preferably use the
FileSystemDatasetClient.open
class method to create a new instance.Parameters
keyword-onlymetadata: DatasetMetadata
keyword-onlystorage_dir: Path
keyword-onlylock: asyncio.Lock
Returns None
drop
Drop the whole dataset and remove all its items.
The backend method for the
Dataset.drop
call.Returns None
get_data
Get data from the dataset with various filtering options.
The backend method for the
Dataset.get_data
call.Parameters
optionalkeyword-onlyoffset: int = 0
optionalkeyword-onlylimit: int | None = 999_999_999_999
optionalkeyword-onlyclean: bool = False
optionalkeyword-onlydesc: bool = False
optionalkeyword-onlyfields: list[str] | None = None
optionalkeyword-onlyomit: list[str] | None = None
optionalkeyword-onlyunwind: str | None = None
optionalkeyword-onlyskip_empty: bool = False
optionalkeyword-onlyskip_hidden: bool = False
optionalkeyword-onlyflatten: list[str] | None = None
optionalkeyword-onlyview: str | None = None
Returns DatasetItemsListPage
get_metadata
Get the metadata of the dataset.
Returns DatasetMetadata
iterate_items
Iterate over the dataset items with filtering options.
The backend method for the
Dataset.iterate_items
call.Parameters
optionalkeyword-onlyoffset: int = 0
optionalkeyword-onlylimit: int | None = None
optionalkeyword-onlyclean: bool = False
optionalkeyword-onlydesc: bool = False
optionalkeyword-onlyfields: list[str] | None = None
optionalkeyword-onlyomit: list[str] | None = None
optionalkeyword-onlyunwind: str | None = None
optionalkeyword-onlyskip_empty: bool = False
optionalkeyword-onlyskip_hidden: bool = False
Returns AsyncIterator[dict[str, Any]]
open
Open or create a file system dataset client.
This method attempts to open an existing dataset from the file system. If a dataset with the specified ID or name exists, it loads the metadata from the stored files. If no existing dataset is found, a new one is created.
Parameters
keyword-onlyid: str | None
The ID of the dataset to open. If provided, searches for existing dataset by ID.
keyword-onlyname: str | None
The name of the dataset to open. If not provided, uses the default dataset.
keyword-onlyconfiguration: Configuration
The configuration object containing storage directory settings.
Returns FileSystemDatasetClient
purge
Purge all items from the dataset.
The backend method for the
Dataset.purge
call.Returns None
push_data
Push data to the dataset.
The backend method for the
Dataset.push_data
call.Parameters
data: list[Any] | dict[str, Any]
Returns None
Properties
path_to_dataset
The full path to the dataset directory.
path_to_metadata
The full path to the dataset metadata file.
File system implementation of the dataset client.
This client persists dataset items to the file system as individual JSON files within a structured directory hierarchy following the pattern:
Each item is stored as a separate file, which allows for durability and the ability to recover after process termination. Dataset operations like filtering, sorting, and pagination are implemented by processing the stored files according to the requested parameters.
This implementation is ideal for long-running crawlers where data persistence is important, and for development environments where you want to easily inspect the collected data between runs.