huggingface · Edge-Explorer · Mar 30, 2026 · Apr 19, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -57,6 +57,11 @@ If you want to add a dataset see specific instructions in the section [*How to a
     pip install -e ".[dev]"
     ```
 
+    Alternatively, with uv:
+    ```bash
+    uv pip install -e ".[dev]"
+    ```
+
    (If datasets was already installed in the virtual environment, remove
    it with `pip uninstall datasets` before reinstalling it in editable
    mode with the `-e` flag.)
@@ -71,7 +76,7 @@ If you want to add a dataset see specific instructions in the section [*How to a
 
 7. _(Optional)_ You can also use [`pre-commit`](https://pre-commit.com/) to format your code automatically each time run `git commit`, instead of running `make style` manually. 
 To do this, install `pre-commit` via `pip install pre-commit` and then run `pre-commit install` in the project's root directory to set up the hooks.
-Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again .
+Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again.
 
 
 8. Once you're happy with your contribution, add your changed files and make a commit to record your changes locally:
@@ -110,7 +115,7 @@ You can share your dataset on https://huggingface.co/datasets directly using you
 
 Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.
 
-If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:
+If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do so, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:
 
 * a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
 * a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
@@ -126,5 +131,5 @@ Thank you for your contribution!
 
 ## Code of conduct
 
-This project adheres to the HuggingFace [code of conduct](CODE_OF_CONDUCT.md).
-By participating, you are expected to abide by this code.
+This project adheres to the Hugging Face [code of conduct](CODE_OF_CONDUCT.md).
+By participating, you are expected to abide by this code.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@
 
 🤗 Datasets is a lightweight library providing **two** main features:
 
-- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
+- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [Hugging Face Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
 - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
 
 [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share)
@@ -40,7 +40,7 @@
 - Native support for audio, image and video data.
 - Enable streaming mode to save disk space and start iterating over the dataset immediately.
 
-🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library.
+🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the Hugging Face team want to deeply thank the TensorFlow Datasets team for building this amazing library.
 
 # Installation
 
@@ -60,11 +60,19 @@ pip install datasets
 conda install -c huggingface -c conda-forge datasets
 ```
 
+## With uv
+
+🤗 Datasets can be installed using uv (fastest) as follows:
+
+```bash
+uv pip install datasets
+```
+
 Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.
 
 For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation
 
-## Installation to use with Machine Learning & Data frameworks frameworks
+## Installation to use with Machine Learning & Data frameworks
 
 If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (0.4+) you should also install PyTorch, TensorFlow or JAX.
 🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately.
@@ -122,7 +130,7 @@ For more details on using the library, check the quick start page in the documen
 
 # Add a new dataset to the Hub
 
-We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).
+We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [Hugging Face Datasets Hub](https://huggingface.co/datasets).
 
 You can find:
 - [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also

diff --git a/docs/README.md b/docs/README.md
@@ -1,5 +1,5 @@
 <!---
-Copyright 2020 The HuggingFace Team. All rights reserved.
+Copyright 2020 The Hugging Face Team. All rights reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.

diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
@@ -160,7 +160,7 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
 <tf>
 
 Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
-TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
+TensorFlow, and ready to train/fine-tune a model, as it wraps a Hugging Face [`~datasets.Dataset`] as a `tf.data.Dataset`
 with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
 
 ```py
@@ -248,7 +248,7 @@ Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc
 <tf>
 
 Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
-TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
+TensorFlow, and ready to train/fine-tune a model, as it wraps a Hugging Face [`~datasets.Dataset`] as a `tf.data.Dataset`
 with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
 
 Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:
@@ -355,7 +355,7 @@ Use the [`~Dataset.with_format`] function to set the dataset format to `torch` a
 <tf>
 
 Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
-TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
+TensorFlow, and ready to train/fine-tune a model, as it wraps a Hugging Face [`~datasets.Dataset`] as a `tf.data.Dataset`
 with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
 
 ```py

diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -60,7 +60,7 @@ This special type of dataset has its own set of processing methods shown below.
 
 > [!TIP]
 > An [`IterableDataset`] is useful for iterative jobs like training a model.
-> You shouldn't use a [`IterableDataset`] for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.
+> You shouldn't use an [`IterableDataset`] for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.
 > You can find more details in the [Dataset vs. IterableDataset guide](./about_mapstyle_vs_iterable).
 
 
@@ -97,7 +97,7 @@ The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`Itera
 
 >>> dataset = load_dataset("ethz/food101")
 >>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset
->>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000)  # shuffles the shards order and use a shuffle buffer when you start iterating
+>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000)  # shuffles the shards order and uses a shuffle buffer when you start iterating
 dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4)  # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating
 ```
 
@@ -276,8 +276,8 @@ Define sampling probabilities from each of the original datasets for more contro
 
 Around 80% of the final dataset is made of the `es_dataset`, and 20% of the `fr_dataset`.
 
-You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples.
-You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
+You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e. the dataset construction is stopped as soon as one of the datasets runs out of samples.
+You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every sample in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
 Note that if no sampling probabilities are specified, the new dataset will have `max_length_datasets*nb_dataset samples`.
 There is also `stopping_strategy=all_exhausted_without_replacement` to ensure that every sample is seen exactly once.
 

diff --git a/notebooks/README.md b/notebooks/README.md
@@ -1,5 +1,5 @@
 <!---
-Copyright 2023 The HuggingFace Team. All rights reserved.
+Copyright 2023 The Hugging Face Team. All rights reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.

diff --git a/src/datasets/iterable_dataset.py b/src/datasets/iterable_dataset.py
@@ -93,11 +93,11 @@
 Key = Union[int, str, tuple[int, int], "BuilderKey"]
 
 
-def identity_func(x):
+def identity_func(x: Any) -> Any:
     return x
 
 
-def _rename_columns_fn(example: dict, column_mapping: dict[str, str]):
+def _rename_columns_fn(example: dict, column_mapping: dict[str, str]) -> dict:
     if any(col not in example for col in column_mapping):
         raise ValueError(
             f"Error when renaming {list(column_mapping)} to {list(column_mapping.values())}: columns {set(column_mapping) - set(example)} are not in the dataset."
@@ -3338,7 +3338,7 @@ def map(
           Note that the last batch may have less than `n` examples.
           A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`.
 
-        If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls.
+        If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simultaneous calls.
         It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time.
 
         Args:
@@ -3478,7 +3478,7 @@ def filter(
         """Apply a filter function to all the elements so that the dataset only includes examples according to the filter function.
         The filtering is done on-the-fly when iterating over the dataset.
 
-        If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simulatenous calls (configurable).
+        If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simultaneous calls (configurable).
         It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time.
 
         Args: