open-edge-platform · wonjuleee · Apr 20, 2023 · Apr 20, 2023 · Apr 20, 2023 · Apr 20, 2023
diff --git a/docs/source/docs/level-up/intermediate_skills/05_data_aggregation.rst b/docs/source/docs/level-up/intermediate_skills/05_data_aggregation.rst
@@ -2,4 +2,46 @@
 Level 5: Data Subset Aggregation
 ================================
 
-TBD
+
+When working with public data, the dataset is sometimes provided with pre-divided training,
+validation, and test subsets. However, in some cases, these subsets may not follow an identical
+distribution, making it difficult to perform proper model comparison or selection. In this tutorial,
+we will show an example of dataset aggregation and reorganization to address this issue.
+
+Prepare datasets
+================
+
+As we did in :ref:`level 3 <Level 3: Data Import and Export>`, we use the Cityscapes dataset.
+The Cityscapes dataset is divided into train, validation, and test subsets with the number of 2975,
+500, and 1525 samples, respectively.
+
+Again, more detailed description is given by :ref:`here <Cityscapes>`.
+The Cityscapes dataset is available for free `download <https://www.cityscapes-dataset.com/downloads/>`_.
+
+==============
+
+.. tab-set::
+
+  .. tab-item:: Python
+
+    .. code-block:: python
+
+        from datumaro.components.dataset import Dataset
+
+        data_path = '/path/to/cityscapes'
+        dataset = Dataset.import_from(data_path, 'cityscapes')
+
+        from datumaro.components.hl_ops import HLOps
+
+        aggregated = HLOps.aggregate(dataset, from_subsets=["train", "val", "test"], to_subset="default")
+
+    (Optional) Through :ref:`splitter <Transform>`, we can reorganize the aggregated dataset with respect to the number of annotations in each subset.
+
+    .. code-block:: python
+
+      import datumaro.plugins.splitter as splitter
+
+      splits = [("train", 0.5), ("val", 0.2), ("test", 0.3)]
+      task = splitter.SplitTask.segmentation.name
+
+      resplitted = aggregated.transform("split", task=task, splits=splits)
diff --git a/docs/source/docs/level-up/intermediate_skills/07_data_merge.rst b/docs/source/docs/level-up/intermediate_skills/07_data_merge.rst
@@ -2,4 +2,83 @@
 Level 7: Merge Two Heterogeneous Datasets
 =========================================
 
-TBD
+
+In the latest deep learning trends, training foundation models with larger datasets has become
+increasingly popular. To achieve this, it is crucial to collect and prepare massive datasets for deep
+learning model development. Collecting and labeling large datasets can be challenging, so
+consolidating scattered datasets into a unified one is important. For instance, `Florence <https://arxiv.org/pdf/2111.11432.pdf>`_
+created the FLOD-9M massive dataset by combining MS-COCO, LVIS, OpenImages, and Object365 datasets
+to use for training.
+
+In this tutorial, we provide the simple example for merging two datasets and the detailed description
+for merge operation is given by :ref:`here <Merge>`.
+The more advanced Python example with the label mapping between datasets is given
+:doc:`here <../../jupyter_notebook_examples/notebooks/01_merge_multiple_datasets_for_classification>`.
+
+Prepare datasets
+================
+
+We here download two aerial datasets named by Eurosat and UC Merced as a simple ImageNet format by
+
+.. code-block:: bash
+
+  datum download get -i tfds:eurosat --format imagenet --output-dir <path/to/eurosat> -- --save-media
+
+  datum download get -i tfds:uc_merced --format imagenet --output-dir <path/to/uc_merced> -- --save-media
+
+Merge datasets
+==============
+
+.. tab-set::
+
+  .. tab-item:: CLI
+
+    Without the project declaration, we can simply merge multiple datasets by
+
+    .. code-block:: bash
+
+      datum merge --merge_policy union --format imagenet --output-dir <path/to/output> <path/to/eurosat> <path/to/uc_merced> -- --save-media
+
+    We now have the merge data with the merge report named by ``merge_report.json`` inside the output directory.
+
+  .. tab-item:: Python
+
+    .. code-block:: python
+
+        from datumaro.components.dataset import Dataset
+
+        eurosat_path = '/path/to/eurosat'
+        eurosat = Dataset.import_from(eurosat_path, 'imagenet')
+
+        uc_merced_path = '/path/to/uc_merced'
+        uc_merced = Dataset.import_from(uc_merced_path, 'imagenet')
+
+        from datumaro.components.hl_ops import HLOps
+
+        merged = HLOps.merge(eurosat, uc_merced, merge_policy='union')
+
+  .. tab-item:: ProjectCLI
+
+    With the project-based CLI, we first create two project and import datasets into each project
+
+    .. code-block:: bash
+
+      datum project create --output-dir <path/to/project1>
+      datum project import --format imagenet --project <path/to/project1> <path/to/eurosat>
+
+      datum project create --output-dir <path/to/project2>
+      datum project import --format imagenet --project <path/to/project2> <path/to/uc_merced>
+
+    We merge two projects through
+
+    .. code-block:: bash
+      datum merge --merge_policy union --format imagenet --output-dir <path/to/output> <path/to/project1> <path/to/project2> -- --save-media
+
+    Similar to merge without projects, we have the merge report named by ``merge_report.json`` inside the output directory.
+    Finally, we import the merged data (``<path/to/output>``) into a project.
+    In this tutorial, we create another project and import this into the project.
+
+    .. code-block:: bash
+
+      datum project create --output-dir <path/to/project3>
+      datum project import --format imagenet --project <path/to/project3> <path/to/output>
diff --git a/docs/source/docs/level-up/intermediate_skills/08_data_validate.rst b/docs/source/docs/level-up/intermediate_skills/08_data_validate.rst
@@ -48,7 +48,7 @@ The Python example for the usage of validator is described in `here <https://git
 
     .. code-block:: bash
 
-      datum project import --format coco_instances -p <path/to/project> <path/to/cityscapes>
+      datum project import --format coco_instances -p <path/to/project> <path/to/data>
 
     (Optional) When we import a data, the change is automatically commited in the project.
     This can be shown through ``log`` as

diff --git a/docs/source/docs/level-up/intermediate_skills/index.rst b/docs/source/docs/level-up/intermediate_skills/index.rst
@@ -23,7 +23,7 @@ Intermediate Skills
          :outline:
          :expand:
 
-         Level 05: Data Aggregation
+         Level 05: Data Subset Aggregation
 
       :bdg-warning:`Python`
 
@@ -47,8 +47,9 @@ Intermediate Skills
 
          Level 07: Dataset Merge
 
-      :bdg-warning:`Python`
       :bdg-info:`CLI`
+      :bdg-warning:`Python`
+      :bdg-success:`ProjectCLI`
 
    .. grid-item-card::
 
@@ -71,8 +72,8 @@ Intermediate Skills
 
          Level 09: Data Exploration
 
-      :bdg-warning:`Python`
       :bdg-info:`CLI`
+      :bdg-warning:`Python`
 
    .. grid-item-card::
 
@@ -83,5 +84,5 @@ Intermediate Skills
 
          Level 10: Data Generation
 
-      :bdg-warning:`Python`
       :bdg-info:`CLI`
+      :bdg-warning:`Python`