open-edge-platform · jihyeonyi · Jul 18, 2023 · Jul 14, 2023 · Jul 14, 2023 · Jul 17, 2023
@@ -43,6 +43,7 @@ Supported Data Formats
    segment_anything
    sly_pointcloud
    synthia
+   tabular
    vgg_face2
    video
    vott_csv
@@ -193,6 +194,9 @@ Supported Data Formats
    * `Format specification <https://synthia-dataset.net/>`_
    * `Dataset example <https://github.com/openvinotoolkit/datumaro/tree/develop/tests/assets/synthia_dataset>`_
    * `Format documentation <synthia.md>`_
+* Tabular (``classification``, ``regression``) (import/export only)
+   * `Dataset example <https://github.com/openvinotoolkit/datumaro/tree/develop/tests/assets/tabular_dataset/adopt-a-buddy>`_
+   * `Format documentation <tabular.md>`_
 * TF Detection API (``bboxes``, ``masks``)
    * Format specifications: `[bboxes] <https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md>`_, `[masks] <https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/instance_segmentation.md>`_
    * `Dataset example <https://github.com/openvinotoolkit/datumaro/tree/develop/tests/assets/tf_detection_api_dataset>`_

@@ -0,0 +1,122 @@
+# Tabular
+
+## Format specification
+
+Tabular dataset generally refers to table data with multiple rows and columns. <br>
+`.csv` files are the most common format, and OpenML uses `.arff` as the official format.
+
+Datumaro only supports tabular data in `.csv` format where the first row is a header with unique column names.
+It's because other formats can be converted to `.csv` easily as shown below.
+
+```python
+# convert '.arff' to '.csv'
+from scipy.io.arff import loadarff
+import pandas as pd
+data = loadarff("dataset.arff")
+df = pd.DataFrame(data[0])
+categorical = [col for col in df.columns if df[col].dtype=="O"]
+df[categorical] = df[categorical_columns].apply(lambda x: x.str.decode('utf8'))
+df.to_csv("arff.csv", index=False)
+
+# convert '.parquet', '.feather', '.hdf5', '.pickle' to '.csv'.
+pd.read_parquet("dataset.parquet").to_csv('parquet.csv', index=False)
+pd.read_feather("dataset.feather").to_csv('feather.csv', index=False)
+pd.read_hdf("dataset.hdf5").to_csv('hdf5.csv', index=False)
+pd.read_pickle("dataset.pickle").to_csv('pickle.csv', index=False)
+
+# convert '.jay' to '.csv'
+import datatable as dt
+data = dt.fread("dataset.jay")
+data.to_csv("jay.csv")
+```
+
+A tabular dataset can be one of the following:
+- a single file with a `.csv` extension
+- a directory contains `.csv` files (supports only 1 depth).
+    <!--lint disable fenced-code-flag-->
+    ```
+    dataset/
+    ├── aaa.csv
+    ├── ...
+    └── zzz.csv
+    ```
+
+Supported annotation types:
+- `Tabular`
+
+## Import tabular dataset
+
+A Datumaro project with a tabular source can be created in the following way:
+
+```bash
+datum project create
+datum project import --format tabular <path/to/dataset>
+```
+
+It is also possible to import the dataset using Python API:
+
+```python
+import datumaro as dm
+dataset = dm.Dataset.import_from('<path/to/dataset>', 'tabular')
+```
+
+Datumaro stores the imported table as media (a list of `TableRow`) and annotates the target columns.
+The last column is regarded as the target column,
+which can be specified by the user when importing the dataset as shown below.
+
+```bash
+datum project create
+datum project import --format tabular <path/to/buddy/dataset> -- --target breed_category,pet_category
+datum project import --format tabular <path/to/electricity/dataset> -- --target class
+```
+
+```python
+import datumaro as dm
+dataset = dm.Dataset.import_from('<path/to/buddy/dataset>', 'tabular', target=["breed_category", "pet_category"])
+dataset = dm.Dataset.import_from('<path/to/electricity/dataset>', 'tabular', target="class")
+```
+
+As shown, the target can be a single column name or a comma-separated list of columns.
+
+Note that each tabular file is considered as a subset.
+
+## Export tabular dataset
+
+Datumaro supports exporting a tabular dataset using CLI or python API.
+Each subset will be saved to a separate `.csv` file.
+
+```bash
+datum project create
+datum project import -f tabular <path/to/dataset>
+datum project export -f tabular -o <output/dir>
+```
+
+```python
+import datumaro as dm
+dataset = dm.Dataset.import_from('<path/to/dataset>', 'tabular')
+dataset.export('<path/to/output/dir>', 'tabular')
+```
+
+Note that converting a tabular dataset into other formats and vice versa is not supproted.
+
+## Examples
+Examples of using this format from the code can be found in
+[the format tests](https://github.com/openvinotoolkit/datumaro/blob/develop/tests/unit/test_tabular_format.py).
+The datasets here are randomly sampled and you can find the full dataset below.
+- Electricity Datset: <https://www.openml.org/d/44156>
+    |     date |   day |   period |   nswprice |   nswdemand |   vicprice |   vicdemand |   transfer | class   |
+    |---------:|------:|---------:|-----------:|------------:|-----------:|------------:|-----------:|:--------|
+    | 0.425556 |     5 | 0.340426 |   0.076108 |    0.392889 |   0.003467 |    0.422915 |   0.414912 | UP      |
+    | 0.425512 |     4 | 0.617021 |   0.060376 |    0.483041 |   0.003467 |    0.422915 |   0.414912 | DOWN    |
+    | 0.013982 |     4 | 0.042553 |   0.061967 |    0.521125 |   0.003467 |    0.422915 |   0.414912 | DOWN    |
+    | 0.907349 |     3 | 0.06383  |   0.080581 |    0.331003 |   0.00538  |    0.47566  |   0.441228 | DOWN    |
+    | 0.889341 |     0 | 0.361702 |   0.027141 |    0.379649 |   0.001624 |    0.248317 |   0.69386  | DOWN    |
+
+- Buddy Dataset: <https://www.kaggle.com/datasets/akash14/adopt-a-buddy>
+    | pet_id     | issue_date          | listing_date        |   condition | color_type   |   length(m) |   height(cm) |   X1 |   X2 |   breed_category |   pet_category |
+    |:-----------|:--------------------|:--------------------|------------:|:-------------|------------:|-------------:|-----:|-----:|-----------------:|---------------:|
+    | ANSL_59957 | 2015-10-21 00:00:00 | 2016-11-12 09:00:00 |         nan | Lynx Point   |        0.49 |        24.53 |   16 |    9 |                2 |              1 |
+    | ANSL_57687 | 2016-08-25 00:00:00 | 2016-09-20 08:11:00 |         nan | Red          |        0.87 |        43.17 |   15 |    4 |                2 |              4 |
+    | ANSL_62277 | 2014-12-29 00:00:00 | 2017-01-19 14:47:00 |         nan | Brown        |        0.81 |        25.72 |   15 |    4 |                2 |              4 |
+    | ANSL_72624 | 2016-10-30 00:00:00 | 2017-02-18 14:57:00 |           1 | Brown Tabby  |        0.36 |        10.18 |    0 |    1 |                0 |              1 |
+    | ANSL_51838 | 2014-12-29 00:00:00 | 2017-01-19 14:46:00 |         nan | Brown        |        0.97 |        48.7  |   15 |    4 |                2 |              4 |
@@ -4,6 +4,7 @@ Datumaro supports the following media types:
 - 2D RGB(A) images
 - Videos
 - KITTI Point Clouds
+- Tabular file (csv format)
 
 To create an unlabelled dataset from an arbitrary directory with images use
 `image_dir` and `image_zip` formats:
@@ -75,3 +76,18 @@ Datumaro supports the following video formats:
 .mp4, .mpg, .mpeg, .m2p, .ps, .ts, .m2ts, .mxf, .ogg, .ogv, .ogx,
 .mov, .qt, .rmvb, .vob, .webm
 ```
+
+Also, Daturamo supports a tabular format.
+A tabular dataset can be a single `.csv` file or a folder contains `.csv` files.
+
+``` bash
+cd </path/to/project>
+datum project create
+datum project import -f tabular </path/to/tabular>
+```
+
+```python
+from datumaro import Dataset
+
+dataset = Dataset.import_from('/path/to/tabular', 'tabular')
+```
@@ -1115,6 +1115,17 @@ def add(
         dtype: Type[TableDtype],
         labels: Optional[Set[str]] = None,
     ) -> int:
+        """
+        Add a Tabular Category.
+
+        Args:
+            name (str): Column name
+            dtype (type): Type of the corresponding column. (str, int, or float)
+            labels (optional, set(str)): Label values where the column can have.
+
+        Returns:
+            int: A index of added category.
+        """
         assert name
         assert name not in self._indices_by_name
         assert dtype
@@ -1126,6 +1137,15 @@ def add(
         return index
 
     def find(self, name: str) -> Tuple[Optional[int], Optional[Category]]:
+        """
+        Find Category information for the given column name.
+
+        Args:
+            name (str): Column name
+
+        Returns:
+            tuple(int, Category): A index and Category information.
+        """
         index = self._indices_by_name.get(name)
         return index, self.items[index] if index is not None else None
 

@@ -1206,15 +1206,13 @@ def as_dict(self) -> Dict[str, Any]:
 
 
 class Table:
-    """
-    Provides random access to the table row.
-    """
-
     def __init__(
         self,
     ) -> None:
         """
-        Constructor for Table media.
+        Table data with multiple rows and columns.
+        This provides random access to the table row.
+
         Initialization must be done in the child class.
         """
         assert self.__class__ != Table, (
@@ -1226,7 +1224,12 @@ def __init__(
 
     @classmethod
     def from_csv(cls, path: str, *args, **kwargs) -> Type[Table]:
-        """Returns Table instance creating from a csv file."""
+        """
+        Returns Table instance creating from a csv file.
+
+        Args:
+            path (str) : Path to csv file.
+        """
         return TableFromCSV(path, *args, **kwargs)
 
     @classmethod
@@ -1236,7 +1239,12 @@ def from_dataframe(
         *args,
         **kwargs,
     ) -> Type[Table]:
-        """Returns Table instance creating from a pandas DataFrame."""
+        """
+        Returns Table instance creating from a pandas DataFrame.
+
+        Args:
+            data (DataFrame) : Data in pandas DataFrame format.
+        """
         return TableFromDataFrame(data, *args, **kwargs)
 
     @classmethod
@@ -1246,7 +1254,12 @@ def from_list(
         *args,
         **kwargs,
     ) -> Type[Table]:
-        """Returns Table instance creating from a list of dicts."""
+        """
+        Returns Table instance creating from a list of dicts.
+
+        Args:
+            data (list(dict(str,str|int|float))) : A list of table row data.
+        """
         return TableFromListOfDict(data, *args, **kwargs)
 
     def __eq__(self, other: object) -> bool:
@@ -1288,9 +1301,7 @@ def dtype(self, column: str) -> Optional[Type[TableDtype]]:
             return type(np.zeros(1, numpy_type).tolist()[0])
 
     def features(self, column: str, unique: Optional[bool] = False) -> List[TableDtype]:
-        """
-        Get features for a given column name.
-        """
+        """Get features for a given column name."""
         if unique:
             return list(self.data[column].unique())
         else:
@@ -1300,6 +1311,12 @@ def save(
         self,
         path: str,
     ):
+        """
+        Save table instance to a '.csv' file.
+
+        Args:
+            path (str) : Path to the output csv file.
+        """
         data: pd.DataFrame = self.data
         os.makedirs(osp.dirname(path), exist_ok=True)
         data.to_csv(path, index=False)
@@ -1316,10 +1333,13 @@ def __init__(
         **kwargs,
     ) -> None:
         """
-        Constructor for TableFromCSV.
-        @param path: Path to csv file
-        @param sep: Delimiter to use.
-        @param encoding: Encoding to use for UTF when reading/writing (ex. 'utf-8').
+        Read a '.csv' file and compose a Table instance.
+
+        Args:
+            path (str) : Path to csv file.
+            dtype (optional, dict(str,str)) : Dictionay of column name -> type str ('str', 'int', or 'float').
+            sep (optional, str) : Delimiter to use.
+            encoding (optional, str) : Encoding to use for UTF when reading/writing (ex. 'utf-8').
         """
         super().__init__(path, *args, **kwargs)
 
@@ -1348,6 +1368,12 @@ def __init__(
         *args,
         **kwargs,
     ):
+        """
+        Read a pandas DataFrame and compose a Table instance.
+
+        Args:
+            data (DataFrame) : Data in pandas DataFrame format.
+        """
         super().__init__(data=data, *args, **kwargs)
 
         if data is None:
@@ -1373,13 +1399,27 @@ def __init__(
         *args,
         **kwargs,
     ):
+        """
+        Read a list of table row data and compose a Table instance.
+        The table row data is in dictionary format.
+
+        Args:
+            data (list(dict(str,str|int|float))) : A list of table row data.
+        """
         super().__init__(data=pd.DataFrame(data), *args, **kwargs)
 
 
 class TableRow(MediaElement):
     _type = MediaType.TABLE_ROW
 
     def __init__(self, table: Table, index: int):
+        """
+        TableRow media refers to a Table instance and its row index.
+
+        Args:
+            table (Table) : Table instance.
+            index (int) : Row index.
+        """
         if table is None:
             raise ValueError("'table' can't be None")
         if index < 0 or index >= table.shape[0]:
@@ -1389,13 +1429,23 @@ def __init__(self, table: Table, index: int):
 
     @property
     def table(self) -> Table:
+        """Table instance"""
         return self._table
 
     @property
     def index(self) -> int:
+        """Row index"""
         return self._index
 
     def data(self, targets: Optional[List[str]] = None) -> Dict:
+        """
+        Row data in dict format.
+
+        Args:
+            targets (optional, list(str)) : If this is specified,
+                the values corresponding to target colums will be returned.
+                Otherwise, whole row data will be returned.
+        """
         row = self.table.data.iloc[self.index]
         if targets:
             row = row[targets]