Skip to content

Add Zarr / OME-Zarr Dataset Support#8135

Open
Elsword016 wants to merge 3 commits intohuggingface:mainfrom
Elsword016:feat/zarr-support
Open

Add Zarr / OME-Zarr Dataset Support#8135
Elsword016 wants to merge 3 commits intohuggingface:mainfrom
Elsword016:feat/zarr-support

Conversation

@Elsword016
Copy link
Copy Markdown

Add Zarr / OME-Zarr Dataset Support

TL;DR

This PR adds support for Zarr-backed datasets to datasets, with a focus on lazy, streaming-friendly access to large multidimensional arrays. It introduces a new Zarr feature, lazy proxy objects for arrays, groups, and OME-Zarr multiscale images, a zarrfolder packaged module, Hub-compatible hf:// opening, upload helpers for large Zarr stores, and patch-based utilities for training workflows. Initially, I started working on this for my lab's microscopy and connectomics data, but I think it's at a point where a PR is necessary for everyone to use, and also getting more improvements/feedback, etc. I know it's quite massive, but I will be very happy to get in touch and respond to any queries.

The core design is intentionally lightweight: dataset rows store only a path to a Zarr store, and array data is opened lazily only when the user accesses slices, levels, patches, or metadata.

Motivation

Zarr is a common storage format for large scientific data workloads like Microscopy volumes, geospatial arrays, medical images, and connectomics data because it stores arrays as independently readable chunks inside a hierarchical store. OME-Zarr builds on Zarr for bioimaging by standardizing multiscale image pyramids, axes, coordinate transforms, channel metadata, and labels.

Before this PR, users could not represent these datasets directly in datasets without writing custom loading code outside the library. This PR brings that workflow into the datasets abstraction instead of requiring every scientific ML project to rebuild it.

Related Issues

This PR addresses the core request from #4096, which asked for streaming support for hosted Zarr stores. That issue highlighted two problems that this PR now handles:

  • Zarr stores are designed for cloud/fsspec-style access, but datasets did not expose a native streaming path for them.
  • Zarr stores can contain many small chunk files, which makes naive Git/Hub upload workflows difficult.

Since that issue was opened, the Hub and datasets stack has gained stronger hf:// and fsspec integration. This PR builds on that foundation by opening Hub-backed Zarr paths through FsspecStore.from_url(...) and by preserving token-aware access for private dataset repositories.

This PR also addresses the Zarr part of #7863, which requests native support for cloud-native formats such as Lance, Vortex, Iceberg, and Zarr, along with finer-grained streaming control. This PR is intentionally scoped to Zarr. The recommended index/manifest pattern gives users row-level control over which stores are streamed while keeping the heavy array data separate from lightweight dataset metadata.

What This PR Adds

1. Zarr feature

Adds a new Zarr feature type that stores only a path in Arrow:

{"path": "hf://datasets/username/my-zarr-data@main/sample_001.zarr"}

When decoded, the feature returns a lazy ZarrProxy. The proxy does not load array data during dataset construction, iteration, or row decoding. It opens the underlying Zarr store only on first access, such as shape, dtype, __getitem__, get_level(...), thumbnail(...), or patch extraction.

The feature supports inputs such as:

  • local path strings;
  • pathlib.Path values;
  • dictionaries with a path key;
  • Zarr arrays and groups when a path can be recovered from the store.

It also supports decode=False, matching the pattern used by other modality-aware features when users want raw storage values.

2. Lazy proxy hierarchy

The decoded ZarrProxy automatically resolves to the appropriate concrete proxy:

  • ZarrArrayProxy for plain Zarr arrays;
  • ZarrGroupProxy for non-OME Zarr groups;
  • OmeZarrProxy for OME-Zarr multiscale image groups.

This avoids forcing users to know up front whether a path points to an array, group, or OME-Zarr hierarchy. The proxy keeps dataset examples lightweight and pickle-safe, which is important for multiprocessing data loaders and distributed training.

For plain arrays, users can access shape, dtype, chunks, attributes, slices, random patches, and strided patch iteration. For groups, users can inspect keys and access nested arrays lazily. For OME-Zarr stores, users get multiscale image helpers directly on the decoded object.

3. OME-Zarr multiscale support

OME-Zarr stores are detected from multiscale metadata. The implementation supports both major metadata layouts:

  • OME-NGFF v0.4 metadata stored at the top level of Zarr attributes;
  • OME-NGFF v0.5 metadata stored under the ome namespace.

OmeZarrProxy exposes:

  • num_levels
  • levels
  • get_level(level)
  • thumbnail(level=-1)
  • axes
  • scale
  • channel_names
  • iter_patches(...)
  • random_patch(...)

Level 0 is the highest-resolution level, and negative indices are supported for lower-resolution levels, such as level=-1 for thumbnails or quick inspection.

4. Hub-compatible streaming through hf://

Zarr paths can point to Hub-backed locations such as:

hf://datasets/username/my-zarr-data@main/sample_001.zarr

The feature opens hf:// paths through Zarr's fsspec storage support, so reads are delegated to the Zarr library and the Hub filesystem instead of being implemented as a custom downloader in datasets. This keeps the behavior aligned with chunked Zarr semantics: only the metadata and chunks required by the requested operation are fetched.

Private dataset repositories are supported through the existing token_per_repo_id decode path. The implementation extracts the repository id from hf://datasets/... URLs and passes the matching token through storage options.

5. Dataset creation from Zarr folders

Adds a zarrfolder packaged module that discovers .zarr stores and creates one dataset row per store. It supports:

  • local .zarr directory discovery;
  • remote path discovery from file lists;
  • optional label inference from parent folder names;
  • optional metadata integration from metadata.csv, metadata.jsonl, or metadata.parquet.

For local workflows, this PR also adds:

from datasets.utils.zarr_utils import load_zarr_dataset

load_zarr_dataset(...) is the fast local path. It discovers .zarr directories directly and stops descending into each store, avoiding a full scan of every internal chunk file. This is important for large OME-Zarr stores where generic file discovery can be dominated by chunk enumeration.

6. Upload and storage helpers for large stores

Adds:

from datasets.utils.zarr_utils import push_to_hub_zarr

This helper addresses a practical Hub problem raised in #4096: real Zarr stores often contain many chunk files. In a conventional Zarr layout, each logical chunk may be stored as a separate file. For large microscopy or volumetric arrays, that file count can grow very quickly.

For example, an array with shape (1937, 2048, 2048) and chunks (1, 256, 256) has roughly:

ceil(1937 / 1) * ceil(2048 / 256) * ceil(2048 / 256)
= 1937 * 8 * 8
= 123,968 chunk files

This is difficult for Hub/Git-style storage even when the total byte size is reasonable, because repository operations, upload, listing, and file management all become dominated by many small files.

push_to_hub_zarr(...) provides a practical upload path for this case:

  • counts files before upload;
  • uploads small stores directly;
  • optionally rechunks large stores to Zarr v3 with ShardingCodec;
  • preserves array data, group hierarchy, and metadata, including OME-Zarr multiscales;
  • selects upload_folder or upload_large_folder depending on size and strategy;
  • shows progress during counting, rechunking, copying, and upload.

The important part is that sharding reduces the number of physical files without turning the store into an opaque archive. Instead of storing one file per chunk, Zarr v3 sharding groups many logical chunks into fewer larger shard files:

before sharding:
  chunk_000001 -> file
  chunk_000002 -> file
  chunk_000003 -> file
  ...

after sharding:
  shard_000001 -> contains many chunks
  shard_000002 -> contains many chunks
  shard_000003 -> contains many chunks
  ...

This can turn a store with hundreds of thousands of chunk files into a much smaller set of shard files while preserving lazy Zarr access. The store remains readable as Zarr, and users can still request slices, patches, multiscale levels, or thumbnails without downloading the whole dataset.

There is a tradeoff: reading a very small region may touch a larger shard that contains neighboring chunks, so extremely tiny random reads can fetch more bytes than they would in a one-file-per-chunk layout. This is the usual read-amplification tradeoff for sharded storage. For many ML workflows, especially patch-based training and region-based inference, this is acceptable because reads often touch multiple nearby chunks anyway, while the reduction in file count makes Hub upload and repository management practical.

The goal is not to hide all storage decisions from users, but to provide a sensible default path for making large Zarr stores usable on the Hub without tarring them into opaque blobs that cannot be streamed chunk-by-chunk.

7. Training and inference helpers

Adds patch-oriented helpers for array training workflows:

  • iter_patches(...) for deterministic patch iteration;
  • random_patch(...) for random patch sampling;
  • ZarrCollator for PyTorch-style data loading.

ZarrCollator can extract full arrays or random patches from decoded Zarr examples and return batched NumPy arrays or Torch tensors, depending on whether PyTorch is installed. For OME-Zarr inputs, it can select a multiscale level before extracting patches.

Recommended Hub Workflow

For production-scale Hub usage, the recommended pattern is to separate heavy array storage from lightweight dataset indexing:

  1. Upload raw Zarr stores to a data repository.
  2. Create a small index dataset with one row per sample and a path column.
  3. Load the index with load_dataset(..., streaming=True).
  4. Cast the path column to Zarr.
  5. Access slices, levels, thumbnails, or patches lazily from each row.

Example:

from datasets import Zarr, load_dataset

stream_ds = load_dataset("username/my-zarr-index", split="train", streaming=True)
stream_ds = stream_ds.cast_column("scan", Zarr())

sample = next(iter(stream_ds))
proxy = sample["scan"]

print(proxy.shape)
print(proxy.dtype)

patch = proxy.random_patch((1, 128, 128), level=0)
thumbnail = proxy.thumbnail(level=-1)

The index dataset can contain labels, sample ids, split information, patient or experiment metadata, annotations, or paths to related masks. The heavy Zarr store remains external and lazy. This keeps the standard datasets streaming UX while avoiding full-store downloads and avoiding dataset rows that contain array bytes.

Local Workflow

For local directories, users can either construct a dataset directly:

from datasets import Dataset, Features, Zarr

ds = Dataset.from_dict(
    {"scan": ["/path/to/sample.zarr"]},
    features=Features({"scan": Zarr()}),
)

proxy = ds[0]["scan"]
print(proxy.shape)

or use the fast Zarr directory discovery helper:

from datasets.utils.zarr_utils import load_zarr_dataset

ds = load_zarr_dataset("/path/to/zarr_dataset")
proxy = ds[0]["zarr"]

load_dataset("zarrfolder", data_dir=...) is also supported, but for very large local stores load_zarr_dataset(...) is preferred because it avoids enumerating every internal chunk file.

Non-Goals and Constraints

This PR is intentionally scoped to Zarr and OME-Zarr. It does not implement Lance, Vortex, or Iceberg support from #7863.

This PR also does not make every raw Zarr-only Hub repository automatically behave like a fully indexed datasets dataset. For the standard load_dataset(..., streaming=True) experience, an index or manifest dataset is recommended. This gives users explicit control over sample boundaries, labels, metadata, splits, and which Zarr stores are read.

For extremely heavy remote operations, such as repeatedly generating full-resolution projections over very large volumes, users still need to choose conservative read strategies. The feature enables lazy chunk reads, but it does not make expensive access patterns cheap.

Finally, zarrfolder supports direct folder discovery, but generic remote discovery can still involve file-level listing. For large Hub workflows, the index-repo pattern is the intended scalable path.

Validation

This PR adds Zarr-focused test coverage for:

  • Zarr feature encode/decode behavior;
  • lazy proxy behavior for arrays, groups, and OME-Zarr stores;
  • OME-NGFF v0.4 metadata;
  • OME-NGFF v0.5 metadata;
  • patch extraction with iter_patches(...) and random_patch(...);
  • thumbnail(...) and multiscale level access;
  • hf:// store opening and token-aware path handling;
  • zarrfolder discovery, labels, and metadata integration;
  • file counting, shard shape calculation, rechunking, and upload helper behavior;
  • ZarrCollator behavior for arrays, OME-Zarr inputs, labels, and unsupported groups;
  • local fast loading with load_zarr_dataset(...).

Current Zarr-related test status:

  • 114 tests passing.

Summary of User-Facing APIs

New or updated user-facing entry points include:

  • datasets.Zarr
  • load_dataset("zarrfolder", ...)
  • datasets.utils.zarr_utils.load_zarr_dataset
  • datasets.utils.zarr_utils.push_to_hub_zarr
  • datasets.utils.zarr_utils.ZarrCollator
  • ZarrArrayProxy
  • ZarrGroupProxy
  • OmeZarrProxy

Together, these make Zarr stores usable as lazy, streaming dataset values while preserving the familiar datasets pattern: build or stream a dataset, cast a column to a feature type, and decode examples only when they are accessed.

Demo notebook

I added a small demo notebook showing the usage of the introduced APIs: Notebook

Also, the dataset and the index files:

The data is taken from IDR OME-NGFF Samples Id: ExpA_VIP_ASLM_on.zarr

@Elsword016 Elsword016 marked this pull request as ready for review April 16, 2026 04:35
@lhoestq
Copy link
Copy Markdown
Member

lhoestq commented Apr 16, 2026

How does this PR related to the existing one at #7983 ?
They could be complementary, maybe it's worth checking how we can take the best of both PRs ?

@Elsword016
Copy link
Copy Markdown
Author

Hi @lhoestq, thanks for your reply.

My understanding is that #7983 models a Zarr store as a dataset: arrays in the selected/root group become columns, axis 0 becomes rows, and _generate_row_ranges produces chunk-aligned row shards so streaming is efficient. It also adds _CountableBuilderMixin for accurate row counts, v2 consolidated + v3 support, .zarr root detection in resolve_pattern, and hf:// support through fsspec storage_options. That seems like the right model for row-oriented / tabular-ish Zarr stores.

My PR models a Zarr store as the value of a dataset column. Each row stays lightweight, storing only a path, and decodes lazily into a ZarrArrayProxy, ZarrGroupProxy, or OmeZarrProxy. That is the workflow I’m targeting for the bioimaging /scientific imaging community, where one sample may itself be a large N-D array or OME-Zarr multiscale image, and axis 0 is often a z/time/channel rather than the dataset row dimension.

Beyond the feature this PR adds:

  • OME-NGFF v0.4 and v0.5 multiscale support (levels, thumbnail, axes, scale, channel_names). This needs nested/group-aware access, which Add Zarr streaming support (POC) #7983 explicitly defers.
  • zarrfolder / local discovery for folder-of-stores workflows, where each .zarr store becomes one row.
  • push_to_hub_zarr(...) with optional Zarr v3 ShardingCodec rechunking, to address the "hundreds of thousands of chunk files" Hub upload problem from Add support for streaming Zarr stores for hosted datasets #4096.
  • Patch helpers (iter_patches, random_patch) and ZarrCollator for training workflows.
  • token_per_repo_id handling for dataset rows that reference private Hub repos.

Also, in #7983’s Zarr is a packaged builder used through load_dataset("zarr", ...), while this PR’s Zarr is a feature type registered in _FEATURE_TYPES.

The real overlap seems narrower: fsspec / hf:// opening, Zarr v2/v3 metadata detection, and .zarr store-root handling. Those would be worth consolidating into shared helpers so there is one place to fix hf-fsspec or Zarr metadata edge cases. I’d be happy to reuse the store-opening path from #7983 where it makes sense.

So I see the combined design as:

  • load_dataset("zarr", ...) from Add Zarr streaming support (POC) #7983 for “the Zarr store is the dataset”.
  • datasets.Zarr from this PR for “the Zarr store is a lazy value in a dataset row”.
  • Shared internal helpers for store resolution/opening, optional dependency handling, and fsspec/HF path support.

I’m happy to rebase on top of #7983, or if you have any other suggestions, I'm happy to work on that side as well. I hope this explanation helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants