Add Zarr / OME-Zarr Dataset Support#8135
Add Zarr / OME-Zarr Dataset Support#8135Elsword016 wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
How does this PR related to the existing one at #7983 ? |
|
Hi @lhoestq, thanks for your reply. My understanding is that #7983 models a Zarr store as a dataset: arrays in the selected/root group become columns, axis 0 becomes rows, and My PR models a Zarr store as the value of a dataset column. Each row stays lightweight, storing only a path, and decodes lazily into a Beyond the feature this PR adds:
Also, in #7983’s The real overlap seems narrower: fsspec / So I see the combined design as:
I’m happy to rebase on top of #7983, or if you have any other suggestions, I'm happy to work on that side as well. I hope this explanation helps. |
Add Zarr / OME-Zarr Dataset Support
TL;DR
This PR adds support for Zarr-backed datasets to
datasets, with a focus on lazy, streaming-friendly access to large multidimensional arrays. It introduces a newZarrfeature, lazy proxy objects for arrays, groups, and OME-Zarr multiscale images, azarrfolderpackaged module, Hub-compatiblehf://opening, upload helpers for large Zarr stores, and patch-based utilities for training workflows. Initially, I started working on this for my lab's microscopy and connectomics data, but I think it's at a point where a PR is necessary for everyone to use, and also getting more improvements/feedback, etc. I know it's quite massive, but I will be very happy to get in touch and respond to any queries.The core design is intentionally lightweight: dataset rows store only a path to a Zarr store, and array data is opened lazily only when the user accesses slices, levels, patches, or metadata.
Motivation
Zarr is a common storage format for large scientific data workloads like Microscopy volumes, geospatial arrays, medical images, and connectomics data because it stores arrays as independently readable chunks inside a hierarchical store. OME-Zarr builds on Zarr for bioimaging by standardizing multiscale image pyramids, axes, coordinate transforms, channel metadata, and labels.
Before this PR, users could not represent these datasets directly in
datasetswithout writing custom loading code outside the library. This PR brings that workflow into thedatasetsabstraction instead of requiring every scientific ML project to rebuild it.Related Issues
This PR addresses the core request from #4096, which asked for streaming support for hosted Zarr stores. That issue highlighted two problems that this PR now handles:
datasetsdid not expose a native streaming path for them.Since that issue was opened, the Hub and
datasetsstack has gained strongerhf://and fsspec integration. This PR builds on that foundation by opening Hub-backed Zarr paths throughFsspecStore.from_url(...)and by preserving token-aware access for private dataset repositories.This PR also addresses the Zarr part of #7863, which requests native support for cloud-native formats such as Lance, Vortex, Iceberg, and Zarr, along with finer-grained streaming control. This PR is intentionally scoped to Zarr. The recommended index/manifest pattern gives users row-level control over which stores are streamed while keeping the heavy array data separate from lightweight dataset metadata.
What This PR Adds
1.
ZarrfeatureAdds a new
Zarrfeature type that stores only a path in Arrow:{"path": "hf://datasets/username/my-zarr-data@main/sample_001.zarr"}When decoded, the feature returns a lazy
ZarrProxy. The proxy does not load array data during dataset construction, iteration, or row decoding. It opens the underlying Zarr store only on first access, such asshape,dtype,__getitem__,get_level(...),thumbnail(...), or patch extraction.The feature supports inputs such as:
pathlib.Pathvalues;pathkey;It also supports
decode=False, matching the pattern used by other modality-aware features when users want raw storage values.2. Lazy proxy hierarchy
The decoded
ZarrProxyautomatically resolves to the appropriate concrete proxy:ZarrArrayProxyfor plain Zarr arrays;ZarrGroupProxyfor non-OME Zarr groups;OmeZarrProxyfor OME-Zarr multiscale image groups.This avoids forcing users to know up front whether a path points to an array, group, or OME-Zarr hierarchy. The proxy keeps dataset examples lightweight and pickle-safe, which is important for multiprocessing data loaders and distributed training.
For plain arrays, users can access shape, dtype, chunks, attributes, slices, random patches, and strided patch iteration. For groups, users can inspect keys and access nested arrays lazily. For OME-Zarr stores, users get multiscale image helpers directly on the decoded object.
3. OME-Zarr multiscale support
OME-Zarr stores are detected from multiscale metadata. The implementation supports both major metadata layouts:
omenamespace.OmeZarrProxyexposes:num_levelslevelsget_level(level)thumbnail(level=-1)axesscalechannel_namesiter_patches(...)random_patch(...)Level
0is the highest-resolution level, and negative indices are supported for lower-resolution levels, such aslevel=-1for thumbnails or quick inspection.4. Hub-compatible streaming through
hf://Zarr paths can point to Hub-backed locations such as:
The feature opens
hf://paths through Zarr's fsspec storage support, so reads are delegated to the Zarr library and the Hub filesystem instead of being implemented as a custom downloader indatasets. This keeps the behavior aligned with chunked Zarr semantics: only the metadata and chunks required by the requested operation are fetched.Private dataset repositories are supported through the existing
token_per_repo_iddecode path. The implementation extracts the repository id fromhf://datasets/...URLs and passes the matching token through storage options.5. Dataset creation from Zarr folders
Adds a
zarrfolderpackaged module that discovers.zarrstores and creates one dataset row per store. It supports:.zarrdirectory discovery;metadata.csv,metadata.jsonl, ormetadata.parquet.For local workflows, this PR also adds:
load_zarr_dataset(...)is the fast local path. It discovers.zarrdirectories directly and stops descending into each store, avoiding a full scan of every internal chunk file. This is important for large OME-Zarr stores where generic file discovery can be dominated by chunk enumeration.6. Upload and storage helpers for large stores
Adds:
This helper addresses a practical Hub problem raised in #4096: real Zarr stores often contain many chunk files. In a conventional Zarr layout, each logical chunk may be stored as a separate file. For large microscopy or volumetric arrays, that file count can grow very quickly.
For example, an array with shape
(1937, 2048, 2048)and chunks(1, 256, 256)has roughly:This is difficult for Hub/Git-style storage even when the total byte size is reasonable, because repository operations, upload, listing, and file management all become dominated by many small files.
push_to_hub_zarr(...)provides a practical upload path for this case:ShardingCodec;upload_folderorupload_large_folderdepending on size and strategy;The important part is that sharding reduces the number of physical files without turning the store into an opaque archive. Instead of storing one file per chunk, Zarr v3 sharding groups many logical chunks into fewer larger shard files:
This can turn a store with hundreds of thousands of chunk files into a much smaller set of shard files while preserving lazy Zarr access. The store remains readable as Zarr, and users can still request slices, patches, multiscale levels, or thumbnails without downloading the whole dataset.
There is a tradeoff: reading a very small region may touch a larger shard that contains neighboring chunks, so extremely tiny random reads can fetch more bytes than they would in a one-file-per-chunk layout. This is the usual read-amplification tradeoff for sharded storage. For many ML workflows, especially patch-based training and region-based inference, this is acceptable because reads often touch multiple nearby chunks anyway, while the reduction in file count makes Hub upload and repository management practical.
The goal is not to hide all storage decisions from users, but to provide a sensible default path for making large Zarr stores usable on the Hub without tarring them into opaque blobs that cannot be streamed chunk-by-chunk.
7. Training and inference helpers
Adds patch-oriented helpers for array training workflows:
iter_patches(...)for deterministic patch iteration;random_patch(...)for random patch sampling;ZarrCollatorfor PyTorch-style data loading.ZarrCollatorcan extract full arrays or random patches from decoded Zarr examples and return batched NumPy arrays or Torch tensors, depending on whether PyTorch is installed. For OME-Zarr inputs, it can select a multiscale level before extracting patches.Recommended Hub Workflow
For production-scale Hub usage, the recommended pattern is to separate heavy array storage from lightweight dataset indexing:
load_dataset(..., streaming=True).Zarr.Example:
The index dataset can contain labels, sample ids, split information, patient or experiment metadata, annotations, or paths to related masks. The heavy Zarr store remains external and lazy. This keeps the standard
datasetsstreaming UX while avoiding full-store downloads and avoiding dataset rows that contain array bytes.Local Workflow
For local directories, users can either construct a dataset directly:
or use the fast Zarr directory discovery helper:
load_dataset("zarrfolder", data_dir=...)is also supported, but for very large local storesload_zarr_dataset(...)is preferred because it avoids enumerating every internal chunk file.Non-Goals and Constraints
This PR is intentionally scoped to Zarr and OME-Zarr. It does not implement Lance, Vortex, or Iceberg support from #7863.
This PR also does not make every raw Zarr-only Hub repository automatically behave like a fully indexed
datasetsdataset. For the standardload_dataset(..., streaming=True)experience, an index or manifest dataset is recommended. This gives users explicit control over sample boundaries, labels, metadata, splits, and which Zarr stores are read.For extremely heavy remote operations, such as repeatedly generating full-resolution projections over very large volumes, users still need to choose conservative read strategies. The feature enables lazy chunk reads, but it does not make expensive access patterns cheap.
Finally,
zarrfoldersupports direct folder discovery, but generic remote discovery can still involve file-level listing. For large Hub workflows, the index-repo pattern is the intended scalable path.Validation
This PR adds Zarr-focused test coverage for:
Zarrfeature encode/decode behavior;iter_patches(...)andrandom_patch(...);thumbnail(...)and multiscale level access;hf://store opening and token-aware path handling;zarrfolderdiscovery, labels, and metadata integration;ZarrCollatorbehavior for arrays, OME-Zarr inputs, labels, and unsupported groups;load_zarr_dataset(...).Current Zarr-related test status:
Summary of User-Facing APIs
New or updated user-facing entry points include:
datasets.Zarrload_dataset("zarrfolder", ...)datasets.utils.zarr_utils.load_zarr_datasetdatasets.utils.zarr_utils.push_to_hub_zarrdatasets.utils.zarr_utils.ZarrCollatorZarrArrayProxyZarrGroupProxyOmeZarrProxyTogether, these make Zarr stores usable as lazy, streaming dataset values while preserving the familiar
datasetspattern: build or stream a dataset, cast a column to a feature type, and decode examples only when they are accessed.Demo notebook
I added a small demo notebook showing the usage of the introduced APIs: Notebook
Also, the dataset and the index files:
The data is taken from IDR OME-NGFF Samples Id: ExpA_VIP_ASLM_on.zarr