diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3ae44bd4efc..1d544763e53 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -57,6 +57,11 @@ If you want to add a dataset see specific instructions in the section [*How to a pip install -e ".[dev]" ``` + Alternatively, with uv: + ```bash + uv pip install -e ".[dev]" + ``` + (If datasets was already installed in the virtual environment, remove it with `pip uninstall datasets` before reinstalling it in editable mode with the `-e` flag.) @@ -71,7 +76,7 @@ If you want to add a dataset see specific instructions in the section [*How to a 7. _(Optional)_ You can also use [`pre-commit`](https://pre-commit.com/) to format your code automatically each time run `git commit`, instead of running `make style` manually. To do this, install `pre-commit` via `pip install pre-commit` and then run `pre-commit install` in the project's root directory to set up the hooks. -Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again . +Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again. 8. Once you're happy with your contribution, add your changed files and make a commit to record your changes locally: @@ -110,7 +115,7 @@ You can share your dataset on https://huggingface.co/datasets directly using you Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset. -If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide: +If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do so, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide: * a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md) * a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs @@ -126,5 +131,5 @@ Thank you for your contribution! ## Code of conduct -This project adheres to the HuggingFace [code of conduct](CODE_OF_CONDUCT.md). -By participating, you are expected to abide by this code. +This project adheres to the Hugging Face [code of conduct](CODE_OF_CONDUCT.md). +By participating, you are expected to abide by this code. \ No newline at end of file diff --git a/README.md b/README.md index 0c0f4e23c21..de604066331 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ 🤗 Datasets is a lightweight library providing **two** main features: -- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), +- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [Hugging Face Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share) @@ -40,7 +40,7 @@ - Native support for audio, image and video data. - Enable streaming mode to save disk space and start iterating over the dataset immediately. -🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. +🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the Hugging Face team want to deeply thank the TensorFlow Datasets team for building this amazing library. # Installation @@ -60,11 +60,19 @@ pip install datasets conda install -c huggingface -c conda-forge datasets ``` +## With uv + +🤗 Datasets can be installed using uv (fastest) as follows: + +```bash +uv pip install datasets +``` + Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda. For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation -## Installation to use with Machine Learning & Data frameworks frameworks +## Installation to use with Machine Learning & Data frameworks If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (0.4+) you should also install PyTorch, TensorFlow or JAX. 🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately. @@ -122,7 +130,7 @@ For more details on using the library, check the quick start page in the documen # Add a new dataset to the Hub -We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). +We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [Hugging Face Datasets Hub](https://huggingface.co/datasets). You can find: - [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also diff --git a/docs/README.md b/docs/README.md index abcec636429..f23dddb9736 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,5 +1,5 @@