Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@
title: Create a document dataset
- local: nifti_dataset
title: Create a medical imaging dataset
- local: zarr_dataset
title: Create a Zarr dataset
title: "Vision"
- sections:
- local: nlp_load
Expand Down
33 changes: 33 additions & 0 deletions docs/source/about_dataset_features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,36 @@ Another example with tool calling data and the `on_mixed_types="use_json"` argum
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}
```

## Zarr feature

Use [`Zarr`] for path-based lazy access to Zarr and OME-Zarr stores.

```py
>>> from datasets import Dataset, Features, Zarr
>>>
>>> ds = Dataset.from_dict(
... {"scan": ["/path/to/sample.zarr"]},
... features=Features({"scan": Zarr()}),
... )
>>> proxy = ds[0]["scan"] # lazy
>>> proxy.shape
(1937, 2048, 2048)
```

A `Zarr` value resolves lazily to a proxy (`ZarrArrayProxy`, `ZarrGroupProxy`, or `OmeZarrProxy`) depending on store content.

For Hub paths, use `hf://datasets/...` and cast as usual:

```py
>>> from datasets import Dataset, Features, Zarr
>>>
>>> ds = Dataset.from_dict(
... {"scan": ["hf://datasets/username/my-zarr-data@main"]},
... features=Features({"scan": Zarr()}),
... )
>>> ds[0]["scan"].shape
(1937, 2048, 2048)
```

For OME-Zarr stores, multiscale helpers like `num_levels`, `get_level`, and `thumbnail` are available on the resolved proxy.
17 changes: 17 additions & 0 deletions docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,23 @@ If you have remote files likely stored as a `csv`, `json`, `txt`, `parquet` or a
- `https://` URLs for public online files, e.g. `data_files=["https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"]`
- `hf://` URLs for files in any [Dataset repository](https://huggingface.co/docs/hub/datasets-overview) or [Storage Bucket](https://huggingface.co/docs/hub/storage-buckets) on Hugging Face, e.g. `data_files=["hf://datasets/karpathy/tinystories-gpt4-clean/tinystories_gpt4_clean.parquet"]` or `data_files=["hf://buckets/julien-c/my-training-bucket/julien/affluence.csv"]`

### Zarr stores on the Hub (index + cast)

For raw Zarr data on the Hub, the recommended approach is to load a lightweight index dataset and cast a path column to [`Zarr`].

```py
>>> from datasets import Zarr, load_dataset
>>>
>>> stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
>>> stream_ds = stream_ds.cast_column("scan", Zarr())
>>>
>>> sample = next(iter(stream_ds))
>>> sample["scan"].shape
(1937, 2048, 2048)
```

This pattern keeps indexing and metadata in a small tabular dataset while storing heavy Zarr bytes in a separate data repository.

## Multiprocessing

When a dataset is made of several files (that we call "shards"), it is possible to significantly speed up the dataset downloading and preparation step.
Expand Down
4 changes: 4 additions & 0 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,10 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable

[[autodoc]] datasets.Nifti

### Zarr

[[autodoc]] datasets.Zarr

## Filesystems

[[autodoc]] datasets.filesystems.is_remote_filesystem
Expand Down
10 changes: 9 additions & 1 deletion docs/source/package_reference/utilities.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,12 @@ environment variable. You can also enable/disable them using [`~utils.enable_pro

[[autodoc]] datasets.utils.disable_progress_bars

[[autodoc]] datasets.utils.are_progress_bars_disabled
[[autodoc]] datasets.utils.are_progress_bars_disabled

## Zarr utilities

[[autodoc]] datasets.utils.zarr_utils.load_zarr_dataset

[[autodoc]] datasets.utils.zarr_utils.push_to_hub_zarr

[[autodoc]] datasets.utils.zarr_utils.ZarrCollator
20 changes: 20 additions & 0 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,26 @@ This special type of dataset has its own set of processing methods shown below.
> You shouldn't use a [`IterableDataset`] for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.
> You can find more details in the [Dataset vs. IterableDataset guide](./about_mapstyle_vs_iterable).

## Streaming Zarr data

Zarr-backed samples can be streamed lazily from `hf://` paths when used with a [`Zarr`] feature column.

For best performance on large stores:

- start with small regions of interest (ROI),
- use lower-resolution levels first for OME-Zarr,
- avoid full exhaustive reads in exploratory loops.

```py
>>> from datasets import Zarr, load_dataset
>>>
>>> stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
>>> stream_ds = stream_ds.cast_column("scan", Zarr())
>>> sample = next(iter(stream_ds))
>>> lowres = sample["scan"].get_level(-1)
>>> patch = lowres[0, :64, :64]
```


## Column indexing

Expand Down
109 changes: 109 additions & 0 deletions docs/source/zarr_dataset.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Create a Zarr dataset

This guide shows how to create and share datasets backed by Zarr stores, including OME-Zarr multiscale images.

## Recommended workflow

For Hub usage, the recommended pattern is:

1. Store raw Zarr data in a data repository.
2. Create a lightweight index repository (Parquet or JSONL) with one row per sample and a `scan` path column.
3. Load the index with [`load_dataset`] and cast the path column to [`Zarr`].

This gives you the standard `load_dataset(..., streaming=True)` API and lazy access to Zarr content.

## Local loading (fast)

Use [`datasets.utils.zarr_utils.load_zarr_dataset`] to discover `.zarr` directories locally without enumerating all internal chunk files.

```py
>>> from datasets.utils.zarr_utils import load_zarr_dataset
>>>
>>> ds = load_zarr_dataset("/path/to/local/zarr_folder")
>>> ds[0]["zarr"].shape
(1937, 2048, 2048)
```

You can also build a dataset directly from paths:

```py
>>> from datasets import Dataset, Features, Zarr
>>>
>>> ds = Dataset.from_dict(
... {"scan": ["/path/to/sample.zarr"]},
... features=Features({"scan": Zarr()}),
... )
>>> ds[0]["scan"].shape
(1937, 2048, 2048)
```

## Upload a Zarr store to the Hub

Use [`datasets.utils.zarr_utils.push_to_hub_zarr`] to upload a local store. If file count exceeds the configured limit, the store is rechunked and sharded to Zarr v3 before upload.

```py
>>> from datasets.utils.zarr_utils import push_to_hub_zarr
>>>
>>> push_to_hub_zarr(
... local_path="/path/to/sample.zarr",
... repo_id="username/my-zarr-data",
... file_limit=10_000,
... )
```

## Create an index (manifest) dataset

Create a small dataset with one row per sample and a `scan` path column:

```py
>>> from datasets import Dataset, DatasetDict
>>>
>>> record = {
... "sample_id": "sample_001",
... "scan": "hf://datasets/username/my-zarr-data@main",
... }
>>> index_ds = DatasetDict({"train": Dataset.from_list([record])})
>>> index_ds.push_to_hub("username/my-zarr-index", private=True)
```

If your Zarr store is nested in the repo, use the full path, for example:
`hf://datasets/username/my-zarr-data@main/sample_001.zarr`.

## Stream with the standard API (index + cast)

```py
>>> from datasets import Zarr, load_dataset
>>>
>>> stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
>>> stream_ds = stream_ds.cast_column("scan", Zarr())
>>>
>>> sample = next(iter(stream_ds))
>>> proxy = sample["scan"]
>>> proxy.shape, proxy.num_levels
((1937, 2048, 2048), 6)
```

## OME-Zarr helpers

For OME-Zarr stores, [`Zarr`] resolves to an object with multiscale helpers such as:

- `num_levels`
- `get_level(level)`
- `thumbnail(level=-1)`
- `iter_patches(...)`
- `random_patch(...)`

```py
>>> proxy = sample["scan"]
>>> proxy.num_levels
6
>>> arr_l0 = proxy.get_level(0)
>>> patch = proxy.random_patch((1, 128, 128), level=0)
```

## Notes

- `load_dataset("zarrfolder", data_dir=...)` may be slow on very large local stores with many internal chunk files.
- For production-scale workflows, prefer:
- local: [`datasets.utils.zarr_utils.load_zarr_dataset`] or `Dataset.from_dict(...).cast_column(..., Zarr())`
- Hub: index + cast pattern shown above.
1 change: 1 addition & 0 deletions src/datasets/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@
TORCHVISION_AVAILABLE = importlib.util.find_spec("torchvision") is not None
PDFPLUMBER_AVAILABLE = importlib.util.find_spec("pdfplumber") is not None
NIBABEL_AVAILABLE = importlib.util.find_spec("nibabel") is not None
ZARR_AVAILABLE = importlib.util.find_spec("zarr") is not None

# Optional compression tools
RARFILE_AVAILABLE = importlib.util.find_spec("rarfile") is not None
Expand Down
2 changes: 2 additions & 0 deletions src/datasets/features/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
"Video",
"Pdf",
"Nifti",
"Zarr",
]
from .audio import Audio
from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, Json, LargeList, List, Sequence, Value
Expand All @@ -25,3 +26,4 @@
from .pdf import Pdf
from .translation import Translation, TranslationVariableLanguages
from .video import Video
from .zarr import Zarr
2 changes: 2 additions & 0 deletions src/datasets/features/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
from .pdf import Pdf, encode_pdfplumber_pdf
from .translation import Translation, TranslationVariableLanguages
from .video import Video
from .zarr import Zarr


logger = logging.get_logger(__name__)
Expand Down Expand Up @@ -1525,6 +1526,7 @@ def decode_nested_example(schema, obj, token_per_repo_id: Optional[dict[str, Uni
Video.__name__: Video,
Pdf.__name__: Pdf,
Nifti.__name__: Nifti,
Zarr.__name__: Zarr,
Json.__name__: Json,
}

Expand Down
Loading