Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,11 @@ If you want to add a dataset see specific instructions in the section [*How to a
pip install -e ".[dev]"
```

Alternatively, with uv:
```bash
uv pip install -e ".[dev]"
```

(If datasets was already installed in the virtual environment, remove
it with `pip uninstall datasets` before reinstalling it in editable
mode with the `-e` flag.)
Expand All @@ -71,7 +76,7 @@ If you want to add a dataset see specific instructions in the section [*How to a

7. _(Optional)_ You can also use [`pre-commit`](https://pre-commit.com/) to format your code automatically each time run `git commit`, instead of running `make style` manually.
To do this, install `pre-commit` via `pip install pre-commit` and then run `pre-commit install` in the project's root directory to set up the hooks.
Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again .
Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again.


8. Once you're happy with your contribution, add your changed files and make a commit to record your changes locally:
Expand Down Expand Up @@ -110,7 +115,7 @@ You can share your dataset on https://huggingface.co/datasets directly using you

Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.

If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:
If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do so, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:

* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
Expand All @@ -126,5 +131,5 @@ Thank you for your contribution!

## Code of conduct

This project adheres to the HuggingFace [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to abide by this code.
This project adheres to the Hugging Face [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to abide by this code.
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

🤗 Datasets is a lightweight library providing **two** main features:

- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [Hugging Face Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share)
Expand All @@ -40,7 +40,7 @@
- Native support for audio, image and video data.
- Enable streaming mode to save disk space and start iterating over the dataset immediately.

🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library.
🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the Hugging Face team want to deeply thank the TensorFlow Datasets team for building this amazing library.

# Installation

Expand All @@ -60,11 +60,19 @@ pip install datasets
conda install -c huggingface -c conda-forge datasets
```

## With uv

🤗 Datasets can be installed using uv (fastest) as follows:

```bash
uv pip install datasets
```

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation

## Installation to use with Machine Learning & Data frameworks frameworks
## Installation to use with Machine Learning & Data frameworks

If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (0.4+) you should also install PyTorch, TensorFlow or JAX.
🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately.
Expand Down Expand Up @@ -122,7 +130,7 @@ For more details on using the library, check the quick start page in the documen

# Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).
We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [Hugging Face Datasets Hub](https://huggingface.co/datasets).

You can find:
- [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also
Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Copyright 2020 The Hugging Face Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
6 changes: 3 additions & 3 deletions docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
<tf>

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
TensorFlow, and ready to train/fine-tune a model, as it wraps a Hugging Face [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

```py
Expand Down Expand Up @@ -248,7 +248,7 @@ Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc
<tf>

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
TensorFlow, and ready to train/fine-tune a model, as it wraps a Hugging Face [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:
Expand Down Expand Up @@ -355,7 +355,7 @@ Use the [`~Dataset.with_format`] function to set the dataset format to `torch` a
<tf>

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
TensorFlow, and ready to train/fine-tune a model, as it wraps a Hugging Face [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

```py
Expand Down
8 changes: 4 additions & 4 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ This special type of dataset has its own set of processing methods shown below.

> [!TIP]
> An [`IterableDataset`] is useful for iterative jobs like training a model.
> You shouldn't use a [`IterableDataset`] for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.
> You shouldn't use an [`IterableDataset`] for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples.
> You can find more details in the [Dataset vs. IterableDataset guide](./about_mapstyle_vs_iterable).


Expand Down Expand Up @@ -97,7 +97,7 @@ The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`Itera

>>> dataset = load_dataset("ethz/food101")
>>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset
>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000) # shuffles the shards order and use a shuffle buffer when you start iterating
>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000) # shuffles the shards order and uses a shuffle buffer when you start iterating
dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4) # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating
```

Expand Down Expand Up @@ -276,8 +276,8 @@ Define sampling probabilities from each of the original datasets for more contro

Around 80% of the final dataset is made of the `es_dataset`, and 20% of the `fr_dataset`.

You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples.
You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e. the dataset construction is stopped as soon as one of the datasets runs out of samples.
You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every sample in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
Note that if no sampling probabilities are specified, the new dataset will have `max_length_datasets*nb_dataset samples`.
There is also `stopping_strategy=all_exhausted_without_replacement` to ensure that every sample is seen exactly once.

Expand Down
2 changes: 1 addition & 1 deletion notebooks/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!---
Copyright 2023 The HuggingFace Team. All rights reserved.
Copyright 2023 The Hugging Face Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
8 changes: 4 additions & 4 deletions src/datasets/iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,11 @@
Key = Union[int, str, tuple[int, int], "BuilderKey"]


def identity_func(x):
def identity_func(x: Any) -> Any:
return x


def _rename_columns_fn(example: dict, column_mapping: dict[str, str]):
def _rename_columns_fn(example: dict, column_mapping: dict[str, str]) -> dict:
if any(col not in example for col in column_mapping):
raise ValueError(
f"Error when renaming {list(column_mapping)} to {list(column_mapping.values())}: columns {set(column_mapping) - set(example)} are not in the dataset."
Expand Down Expand Up @@ -3338,7 +3338,7 @@ def map(
Note that the last batch may have less than `n` examples.
A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`.

If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls.
If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simultaneous calls.
It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time.

Args:
Expand Down Expand Up @@ -3478,7 +3478,7 @@ def filter(
"""Apply a filter function to all the elements so that the dataset only includes examples according to the filter function.
The filtering is done on-the-fly when iterating over the dataset.

If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simulatenous calls (configurable).
If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simultaneous calls (configurable).
It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time.

Args:
Expand Down