huggingface · Elsword016 · Apr 15, 2026 · Apr 15, 2026 · Apr 15, 2026
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -90,6 +90,8 @@
       title: Create a document dataset
     - local: nifti_dataset
       title: Create a medical imaging dataset
+    - local: zarr_dataset
+      title: Create a Zarr dataset
     title: "Vision"
   - sections:
     - local: nlp_load

diff --git a/docs/source/about_dataset_features.mdx b/docs/source/about_dataset_features.mdx
@@ -198,3 +198,36 @@ Another example with tool calling data and the `on_mixed_types="use_json"` argum
 >>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
 {"room": "living room", "state": "on"}
 ```
+
+## Zarr feature
+
+Use [`Zarr`] for path-based lazy access to Zarr and OME-Zarr stores.
+
+```py
+>>> from datasets import Dataset, Features, Zarr
+>>>
+>>> ds = Dataset.from_dict(
+...     {"scan": ["/path/to/sample.zarr"]},
+...     features=Features({"scan": Zarr()}),
+... )
+>>> proxy = ds[0]["scan"]  # lazy
+>>> proxy.shape
+(1937, 2048, 2048)
+```
+
+A `Zarr` value resolves lazily to a proxy (`ZarrArrayProxy`, `ZarrGroupProxy`, or `OmeZarrProxy`) depending on store content.
+
+For Hub paths, use `hf://datasets/...` and cast as usual:
+
+```py
+>>> from datasets import Dataset, Features, Zarr
+>>>
+>>> ds = Dataset.from_dict(
+...     {"scan": ["hf://datasets/username/my-zarr-data@main"]},
+...     features=Features({"scan": Zarr()}),
+... )
+>>> ds[0]["scan"].shape
+(1937, 2048, 2048)
+```
+
+For OME-Zarr stores, multiscale helpers like `num_levels`, `get_level`, and `thumbnail` are available on the resolved proxy.
diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -246,6 +246,23 @@ If you have remote files likely stored as a `csv`, `json`, `txt`, `parquet` or a
 - `https://` URLs for public online files, e.g. `data_files=["https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"]`
 - `hf://` URLs for files in any [Dataset repository](https://huggingface.co/docs/hub/datasets-overview) or [Storage Bucket](https://huggingface.co/docs/hub/storage-buckets) on Hugging Face, e.g. `data_files=["hf://datasets/karpathy/tinystories-gpt4-clean/tinystories_gpt4_clean.parquet"]` or `data_files=["hf://buckets/julien-c/my-training-bucket/julien/affluence.csv"]`
 
+### Zarr stores on the Hub (index + cast)
+
+For raw Zarr data on the Hub, the recommended approach is to load a lightweight index dataset and cast a path column to [`Zarr`].
+
+```py
+>>> from datasets import Zarr, load_dataset
+>>>
+>>> stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
+>>> stream_ds = stream_ds.cast_column("scan", Zarr())
+>>>
+>>> sample = next(iter(stream_ds))
+>>> sample["scan"].shape
+(1937, 2048, 2048)
+```
+
+This pattern keeps indexing and metadata in a small tabular dataset while storing heavy Zarr bytes in a separate data repository.
+
 ## Multiprocessing
 
 When a dataset is made of several files (that we call "shards"), it is possible to significantly speed up the dataset downloading and preparation step.

diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx
@@ -289,6 +289,10 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable
 
 [[autodoc]] datasets.Nifti
 
+### Zarr
+
+[[autodoc]] datasets.Zarr
+
 ## Filesystems
 
 [[autodoc]] datasets.filesystems.is_remote_filesystem

diff --git a/docs/source/package_reference/utilities.mdx b/docs/source/package_reference/utilities.mdx
@@ -55,4 +55,12 @@ environment variable. You can also enable/disable them using [`~utils.enable_pro
 
 [[autodoc]] datasets.utils.disable_progress_bars
 
-[[autodoc]] datasets.utils.are_progress_bars_disabled
+[[autodoc]] datasets.utils.are_progress_bars_disabled
+
+## Zarr utilities
+
+[[autodoc]] datasets.utils.zarr_utils.load_zarr_dataset
+
+[[autodoc]] datasets.utils.zarr_utils.push_to_hub_zarr
+
+[[autodoc]] datasets.utils.zarr_utils.ZarrCollator
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -63,6 +63,26 @@ This special type of dataset has its own set of processing methods shown below.
 > You shouldn't use a [`IterableDataset`] for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.
 > You can find more details in the [Dataset vs. IterableDataset guide](./about_mapstyle_vs_iterable).
 
+## Streaming Zarr data
+
+Zarr-backed samples can be streamed lazily from `hf://` paths when used with a [`Zarr`] feature column.
+
+For best performance on large stores:
+
+- start with small regions of interest (ROI),
+- use lower-resolution levels first for OME-Zarr,
+- avoid full exhaustive reads in exploratory loops.
+
+```py
+>>> from datasets import Zarr, load_dataset
+>>>
+>>> stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
+>>> stream_ds = stream_ds.cast_column("scan", Zarr())
+>>> sample = next(iter(stream_ds))
+>>> lowres = sample["scan"].get_level(-1)
+>>> patch = lowres[0, :64, :64]
+```
+
 
 ## Column indexing
 

diff --git a/docs/source/zarr_dataset.mdx b/docs/source/zarr_dataset.mdx
@@ -0,0 +1,109 @@
+# Create a Zarr dataset
+
+This guide shows how to create and share datasets backed by Zarr stores, including OME-Zarr multiscale images.
+
+## Recommended workflow
+
+For Hub usage, the recommended pattern is:
+
+1. Store raw Zarr data in a data repository.
+2. Create a lightweight index repository (Parquet or JSONL) with one row per sample and a `scan` path column.
+3. Load the index with [`load_dataset`] and cast the path column to [`Zarr`].
+
+This gives you the standard `load_dataset(..., streaming=True)` API and lazy access to Zarr content.
+
+## Local loading (fast)
+
+Use [`datasets.utils.zarr_utils.load_zarr_dataset`] to discover `.zarr` directories locally without enumerating all internal chunk files.
+
+```py
+>>> from datasets.utils.zarr_utils import load_zarr_dataset
+>>>
+>>> ds = load_zarr_dataset("/path/to/local/zarr_folder")
+>>> ds[0]["zarr"].shape
+(1937, 2048, 2048)
+```
+
+You can also build a dataset directly from paths:
+
+```py
+>>> from datasets import Dataset, Features, Zarr
+>>>
+>>> ds = Dataset.from_dict(
+...     {"scan": ["/path/to/sample.zarr"]},
+...     features=Features({"scan": Zarr()}),
+... )
+>>> ds[0]["scan"].shape
+(1937, 2048, 2048)
+```
+
+## Upload a Zarr store to the Hub
+
+Use [`datasets.utils.zarr_utils.push_to_hub_zarr`] to upload a local store. If file count exceeds the configured limit, the store is rechunked and sharded to Zarr v3 before upload.
+
+```py
+>>> from datasets.utils.zarr_utils import push_to_hub_zarr
+>>>
+>>> push_to_hub_zarr(
+...     local_path="/path/to/sample.zarr",
+...     repo_id="username/my-zarr-data",
+...     file_limit=10_000,
+... )
+```
+
+## Create an index (manifest) dataset
+
+Create a small dataset with one row per sample and a `scan` path column:
+
+```py
+>>> from datasets import Dataset, DatasetDict
+>>>
+>>> record = {
+...     "sample_id": "sample_001",
+...     "scan": "hf://datasets/username/my-zarr-data@main",
+... }
+>>> index_ds = DatasetDict({"train": Dataset.from_list([record])})
+>>> index_ds.push_to_hub("username/my-zarr-index", private=True)
+```
+
+If your Zarr store is nested in the repo, use the full path, for example:
+`hf://datasets/username/my-zarr-data@main/sample_001.zarr`.
+
+## Stream with the standard API (index + cast)
+
+```py
+>>> from datasets import Zarr, load_dataset
+>>>
+>>> stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
+>>> stream_ds = stream_ds.cast_column("scan", Zarr())
+>>>
+>>> sample = next(iter(stream_ds))
+>>> proxy = sample["scan"]
+>>> proxy.shape, proxy.num_levels
+((1937, 2048, 2048), 6)
+```
+
+## OME-Zarr helpers
+
+For OME-Zarr stores, [`Zarr`] resolves to an object with multiscale helpers such as:
+
+- `num_levels`
+- `get_level(level)`
+- `thumbnail(level=-1)`
+- `iter_patches(...)`
+- `random_patch(...)`
+
+```py
+>>> proxy = sample["scan"]
+>>> proxy.num_levels
+6
+>>> arr_l0 = proxy.get_level(0)
+>>> patch = proxy.random_patch((1, 128, 128), level=0)
+```
+
+## Notes
+
+- `load_dataset("zarrfolder", data_dir=...)` may be slow on very large local stores with many internal chunk files.
+- For production-scale workflows, prefer:
+  - local: [`datasets.utils.zarr_utils.load_zarr_dataset`] or `Dataset.from_dict(...).cast_column(..., Zarr())`
+  - Hub: index + cast pattern shown above.
diff --git a/src/datasets/config.py b/src/datasets/config.py
@@ -140,6 +140,7 @@
 TORCHVISION_AVAILABLE = importlib.util.find_spec("torchvision") is not None
 PDFPLUMBER_AVAILABLE = importlib.util.find_spec("pdfplumber") is not None
 NIBABEL_AVAILABLE = importlib.util.find_spec("nibabel") is not None
+ZARR_AVAILABLE = importlib.util.find_spec("zarr") is not None
 
 # Optional compression tools
 RARFILE_AVAILABLE = importlib.util.find_spec("rarfile") is not None

diff --git a/src/datasets/features/__init__.py b/src/datasets/features/__init__.py
@@ -17,6 +17,7 @@
     "Video",
     "Pdf",
     "Nifti",
+    "Zarr",
 ]
 from .audio import Audio
 from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, Json, LargeList, List, Sequence, Value
@@ -25,3 +26,4 @@
 from .pdf import Pdf
 from .translation import Translation, TranslationVariableLanguages
 from .video import Video
+from .zarr import Zarr
diff --git a/src/datasets/features/features.py b/src/datasets/features/features.py
@@ -47,6 +47,7 @@
 from .pdf import Pdf, encode_pdfplumber_pdf
 from .translation import Translation, TranslationVariableLanguages
 from .video import Video
+from .zarr import Zarr
 
 
 logger = logging.get_logger(__name__)
@@ -1525,6 +1526,7 @@ def decode_nested_example(schema, obj, token_per_repo_id: Optional[dict[str, Uni
     Video.__name__: Video,
     Pdf.__name__: Pdf,
     Nifti.__name__: Nifti,
+    Zarr.__name__: Zarr,
     Json.__name__: Json,
 }