Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,46 @@
Level 5: Data Subset Aggregation
================================

TBD

When working with public data, the dataset is sometimes provided with pre-divided training,
validation, and test subsets. However, in some cases, these subsets may not follow an identical
distribution, making it difficult to perform proper model comparison or selection. In this tutorial,
we will show an example of dataset aggregation and reorganization to address this issue.

Prepare datasets
================

As we did in :ref:`level 3 <Level 3: Data Import and Export>`, we use the Cityscapes dataset.
The Cityscapes dataset is divided into train, validation, and test subsets with the number of 2975,
500, and 1525 samples, respectively.

Again, more detailed description is given by :ref:`here <Cityscapes>`.
The Cityscapes dataset is available for free `download <https://www.cityscapes-dataset.com/downloads/>`_.

==============

.. tab-set::

.. tab-item:: Python

.. code-block:: python

from datumaro.components.dataset import Dataset

data_path = '/path/to/cityscapes'
dataset = Dataset.import_from(data_path, 'cityscapes')

from datumaro.components.hl_ops import HLOps

aggregated = HLOps.aggregate(dataset, from_subsets=["train", "val", "test"], to_subset="default")

(Optional) Through :ref:`splitter <Transform>`, we can reorganize the aggregated dataset with respect to the number of annotations in each subset.

.. code-block:: python

import datumaro.plugins.splitter as splitter

splits = [("train", 0.5), ("val", 0.2), ("test", 0.3)]
task = splitter.SplitTask.segmentation.name

resplitted = aggregated.transform("split", task=task, splits=splits)
81 changes: 80 additions & 1 deletion docs/source/docs/level-up/intermediate_skills/07_data_merge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,83 @@
Level 7: Merge Two Heterogeneous Datasets
=========================================

TBD

In the latest deep learning trends, training foundation models with larger datasets has become
increasingly popular. To achieve this, it is crucial to collect and prepare massive datasets for deep
learning model development. Collecting and labeling large datasets can be challenging, so
consolidating scattered datasets into a unified one is important. For instance, `Florence <https://arxiv.org/pdf/2111.11432.pdf>`_
created the FLOD-9M massive dataset by combining MS-COCO, LVIS, OpenImages, and Object365 datasets
to use for training.

In this tutorial, we provide the simple example for merging two datasets and the detailed description
for merge operation is given by :ref:`here <Merge>`.
The more advanced Python example with the label mapping between datasets is given
:doc:`here <../../jupyter_notebook_examples/notebooks/01_merge_multiple_datasets_for_classification>`.

Prepare datasets
================

We here download two aerial datasets named by Eurosat and UC Merced as a simple ImageNet format by

.. code-block:: bash

datum download get -i tfds:eurosat --format imagenet --output-dir <path/to/eurosat> -- --save-media

datum download get -i tfds:uc_merced --format imagenet --output-dir <path/to/uc_merced> -- --save-media

Merge datasets
==============

.. tab-set::

.. tab-item:: CLI

Without the project declaration, we can simply merge multiple datasets by

.. code-block:: bash

datum merge --merge_policy union --format imagenet --output-dir <path/to/output> <path/to/eurosat> <path/to/uc_merced> -- --save-media

We now have the merge data with the merge report named by ``merge_report.json`` inside the output directory.

.. tab-item:: Python

.. code-block:: python

from datumaro.components.dataset import Dataset

eurosat_path = '/path/to/eurosat'
eurosat = Dataset.import_from(eurosat_path, 'imagenet')

uc_merced_path = '/path/to/uc_merced'
uc_merced = Dataset.import_from(uc_merced_path, 'imagenet')

from datumaro.components.hl_ops import HLOps

merged = HLOps.merge(eurosat, uc_merced, merge_policy='union')

.. tab-item:: ProjectCLI

With the project-based CLI, we first create two project and import datasets into each project

.. code-block:: bash

datum project create --output-dir <path/to/project1>
datum project import --format imagenet --project <path/to/project1> <path/to/eurosat>

datum project create --output-dir <path/to/project2>
datum project import --format imagenet --project <path/to/project2> <path/to/uc_merced>

We merge two projects through

.. code-block:: bash
datum merge --merge_policy union --format imagenet --output-dir <path/to/output> <path/to/project1> <path/to/project2> -- --save-media

Similar to merge without projects, we have the merge report named by ``merge_report.json`` inside the output directory.
Finally, we import the merged data (``<path/to/output>``) into a project.
In this tutorial, we create another project and import this into the project.

.. code-block:: bash

datum project create --output-dir <path/to/project3>
datum project import --format imagenet --project <path/to/project3> <path/to/output>
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The Python example for the usage of validator is described in `here <https://git

.. code-block:: bash

datum project import --format coco_instances -p <path/to/project> <path/to/cityscapes>
datum project import --format coco_instances -p <path/to/project> <path/to/data>

(Optional) When we import a data, the change is automatically commited in the project.
This can be shown through ``log`` as
Expand Down
9 changes: 5 additions & 4 deletions docs/source/docs/level-up/intermediate_skills/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Intermediate Skills
:outline:
:expand:

Level 05: Data Aggregation
Level 05: Data Subset Aggregation

:bdg-warning:`Python`

Expand All @@ -47,8 +47,9 @@ Intermediate Skills

Level 07: Dataset Merge

:bdg-warning:`Python`
:bdg-info:`CLI`
:bdg-warning:`Python`
:bdg-success:`ProjectCLI`

.. grid-item-card::

Expand All @@ -71,8 +72,8 @@ Intermediate Skills

Level 09: Data Exploration

:bdg-warning:`Python`
:bdg-info:`CLI`
:bdg-warning:`Python`

.. grid-item-card::

Expand All @@ -83,5 +84,5 @@ Intermediate Skills

Level 10: Data Generation

:bdg-warning:`Python`
:bdg-info:`CLI`
:bdg-warning:`Python`