Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/docs/data-formats/formats/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Supported Data Formats
segment_anything
sly_pointcloud
synthia
tabular
vgg_face2
video
vott_csv
Expand Down Expand Up @@ -193,6 +194,9 @@ Supported Data Formats
* `Format specification <https://synthia-dataset.net/>`_
* `Dataset example <https://github.com/openvinotoolkit/datumaro/tree/develop/tests/assets/synthia_dataset>`_
* `Format documentation <synthia.md>`_
* Tabular (``classification``, ``regression``) (import/export only)
* `Dataset example <https://github.com/openvinotoolkit/datumaro/tree/develop/tests/assets/tabular_dataset/adopt-a-buddy>`_
* `Format documentation <tabular.md>`_
* TF Detection API (``bboxes``, ``masks``)
* Format specifications: `[bboxes] <https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md>`_, `[masks] <https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/instance_segmentation.md>`_
* `Dataset example <https://github.com/openvinotoolkit/datumaro/tree/develop/tests/assets/tf_detection_api_dataset>`_
Expand Down
122 changes: 122 additions & 0 deletions docs/source/docs/data-formats/formats/tabular.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Tabular

## Format specification

Tabular dataset generally refers to table data with multiple rows and columns. <br>
`.csv` files are the most common format, and OpenML uses `.arff` as the official format.

Datumaro only supports tabular data in `.csv` format where the first row is a header with unique column names.
It's because other formats can be converted to `.csv` easily as shown below.

```python
# convert '.arff' to '.csv'
from scipy.io.arff import loadarff
import pandas as pd
data = loadarff("dataset.arff")
df = pd.DataFrame(data[0])
categorical = [col for col in df.columns if df[col].dtype=="O"]
df[categorical] = df[categorical_columns].apply(lambda x: x.str.decode('utf8'))
df.to_csv("arff.csv", index=False)

# convert '.parquet', '.feather', '.hdf5', '.pickle' to '.csv'.
pd.read_parquet("dataset.parquet").to_csv('parquet.csv', index=False)
pd.read_feather("dataset.feather").to_csv('feather.csv', index=False)
pd.read_hdf("dataset.hdf5").to_csv('hdf5.csv', index=False)
pd.read_pickle("dataset.pickle").to_csv('pickle.csv', index=False)

# convert '.jay' to '.csv'
import datatable as dt
data = dt.fread("dataset.jay")
data.to_csv("jay.csv")
```

A tabular dataset can be one of the following:
- a single file with a `.csv` extension
- a directory contains `.csv` files (supports only 1 depth).
<!--lint disable fenced-code-flag-->
```
dataset/
├── aaa.csv
├── ...
└── zzz.csv
```

Supported annotation types:
- `Tabular`

## Import tabular dataset

A Datumaro project with a tabular source can be created in the following way:

```bash
datum project create
datum project import --format tabular <path/to/dataset>
```

It is also possible to import the dataset using Python API:

```python
import datumaro as dm
dataset = dm.Dataset.import_from('<path/to/dataset>', 'tabular')
```

Datumaro stores the imported table as media (a list of `TableRow`) and annotates the target columns.
The last column is regarded as the target column,
which can be specified by the user when importing the dataset as shown below.

```bash
datum project create
datum project import --format tabular <path/to/buddy/dataset> -- --target breed_category,pet_category
datum project import --format tabular <path/to/electricity/dataset> -- --target class
```

```python
import datumaro as dm
dataset = dm.Dataset.import_from('<path/to/buddy/dataset>', 'tabular', target=["breed_category", "pet_category"])
dataset = dm.Dataset.import_from('<path/to/electricity/dataset>', 'tabular', target="class")
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about showing an example of how a table is displayed?
like df.head() in pandas or

|   date   | day | period   | nswprice | nswdemand | vicprice | vicdemand | transfer | class |
|----------|-----|----------|----------|-----------|----------|-----------|----------|-------|
| 0.425556 |  5  | 0.340426 | 0.076108 |  0.392889 | 0.003467 |  0.422915 | 0.414912 |   UP  |

If it is not proper this, please ignore this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, and I added examples at last.


As shown, the target can be a single column name or a comma-separated list of columns.

Note that each tabular file is considered as a subset.

## Export tabular dataset

Datumaro supports exporting a tabular dataset using CLI or python API.
Each subset will be saved to a separate `.csv` file.

```bash
datum project create
datum project import -f tabular <path/to/dataset>
datum project export -f tabular -o <output/dir>
```

```python
import datumaro as dm
dataset = dm.Dataset.import_from('<path/to/dataset>', 'tabular')
dataset.export('<path/to/output/dir>', 'tabular')
```

Note that converting a tabular dataset into other formats and vice versa is not supproted.

## Examples
Examples of using this format from the code can be found in
[the format tests](https://github.com/openvinotoolkit/datumaro/blob/develop/tests/unit/test_tabular_format.py).
The datasets here are randomly sampled and you can find the full dataset below.
- Electricity Datset: <https://www.openml.org/d/44156>
| date | day | period | nswprice | nswdemand | vicprice | vicdemand | transfer | class |
|---------:|------:|---------:|-----------:|------------:|-----------:|------------:|-----------:|:--------|
| 0.425556 | 5 | 0.340426 | 0.076108 | 0.392889 | 0.003467 | 0.422915 | 0.414912 | UP |
| 0.425512 | 4 | 0.617021 | 0.060376 | 0.483041 | 0.003467 | 0.422915 | 0.414912 | DOWN |
| 0.013982 | 4 | 0.042553 | 0.061967 | 0.521125 | 0.003467 | 0.422915 | 0.414912 | DOWN |
| 0.907349 | 3 | 0.06383 | 0.080581 | 0.331003 | 0.00538 | 0.47566 | 0.441228 | DOWN |
| 0.889341 | 0 | 0.361702 | 0.027141 | 0.379649 | 0.001624 | 0.248317 | 0.69386 | DOWN |

- Buddy Dataset: <https://www.kaggle.com/datasets/akash14/adopt-a-buddy>
| pet_id | issue_date | listing_date | condition | color_type | length(m) | height(cm) | X1 | X2 | breed_category | pet_category |
|:-----------|:--------------------|:--------------------|------------:|:-------------|------------:|-------------:|-----:|-----:|-----------------:|---------------:|
| ANSL_59957 | 2015-10-21 00:00:00 | 2016-11-12 09:00:00 | nan | Lynx Point | 0.49 | 24.53 | 16 | 9 | 2 | 1 |
| ANSL_57687 | 2016-08-25 00:00:00 | 2016-09-20 08:11:00 | nan | Red | 0.87 | 43.17 | 15 | 4 | 2 | 4 |
| ANSL_62277 | 2014-12-29 00:00:00 | 2017-01-19 14:47:00 | nan | Brown | 0.81 | 25.72 | 15 | 4 | 2 | 4 |
| ANSL_72624 | 2016-10-30 00:00:00 | 2017-02-18 14:57:00 | 1 | Brown Tabby | 0.36 | 10.18 | 0 | 1 | 0 | 1 |
| ANSL_51838 | 2014-12-29 00:00:00 | 2017-01-19 14:46:00 | nan | Brown | 0.97 | 48.7 | 15 | 4 | 2 | 4 |
16 changes: 16 additions & 0 deletions docs/source/docs/data-formats/media_formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Datumaro supports the following media types:
- 2D RGB(A) images
- Videos
- KITTI Point Clouds
- Tabular file (csv format)

To create an unlabelled dataset from an arbitrary directory with images use
`image_dir` and `image_zip` formats:
Expand Down Expand Up @@ -75,3 +76,18 @@ Datumaro supports the following video formats:
.mp4, .mpg, .mpeg, .m2p, .ps, .ts, .m2ts, .mxf, .ogg, .ogv, .ogx,
.mov, .qt, .rmvb, .vob, .webm
```

Also, Daturamo supports a tabular format.
A tabular dataset can be a single `.csv` file or a folder contains `.csv` files.

``` bash
cd </path/to/project>
datum project create
datum project import -f tabular </path/to/tabular>
```

```python
from datumaro import Dataset

dataset = Dataset.import_from('/path/to/tabular', 'tabular')
```
20 changes: 20 additions & 0 deletions src/datumaro/components/annotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1115,6 +1115,17 @@ def add(
dtype: Type[TableDtype],
labels: Optional[Set[str]] = None,
) -> int:
"""
Add a Tabular Category.

Args:
name (str): Column name
dtype (type): Type of the corresponding column. (str, int, or float)
labels (optional, set(str)): Label values where the column can have.

Returns:
int: A index of added category.
"""
assert name
assert name not in self._indices_by_name
assert dtype
Expand All @@ -1126,6 +1137,15 @@ def add(
return index

def find(self, name: str) -> Tuple[Optional[int], Optional[Category]]:
"""
Find Category information for the given column name.

Args:
name (str): Column name

Returns:
tuple(int, Category): A index and Category information.
"""
index = self._indices_by_name.get(name)
return index, self.items[index] if index is not None else None

Expand Down
80 changes: 65 additions & 15 deletions src/datumaro/components/media.py
Original file line number Diff line number Diff line change
Expand Up @@ -1206,15 +1206,13 @@ def as_dict(self) -> Dict[str, Any]:


class Table:
"""
Provides random access to the table row.
"""

def __init__(
self,
) -> None:
"""
Constructor for Table media.
Table data with multiple rows and columns.
This provides random access to the table row.

Initialization must be done in the child class.
"""
assert self.__class__ != Table, (
Expand All @@ -1226,7 +1224,12 @@ def __init__(

@classmethod
def from_csv(cls, path: str, *args, **kwargs) -> Type[Table]:
"""Returns Table instance creating from a csv file."""
"""
Returns Table instance creating from a csv file.

Args:
path (str) : Path to csv file.
"""
return TableFromCSV(path, *args, **kwargs)

@classmethod
Expand All @@ -1236,7 +1239,12 @@ def from_dataframe(
*args,
**kwargs,
) -> Type[Table]:
"""Returns Table instance creating from a pandas DataFrame."""
"""
Returns Table instance creating from a pandas DataFrame.

Args:
data (DataFrame) : Data in pandas DataFrame format.
"""
return TableFromDataFrame(data, *args, **kwargs)

@classmethod
Expand All @@ -1246,7 +1254,12 @@ def from_list(
*args,
**kwargs,
) -> Type[Table]:
"""Returns Table instance creating from a list of dicts."""
"""
Returns Table instance creating from a list of dicts.

Args:
data (list(dict(str,str|int|float))) : A list of table row data.
"""
return TableFromListOfDict(data, *args, **kwargs)

def __eq__(self, other: object) -> bool:
Expand Down Expand Up @@ -1288,9 +1301,7 @@ def dtype(self, column: str) -> Optional[Type[TableDtype]]:
return type(np.zeros(1, numpy_type).tolist()[0])

def features(self, column: str, unique: Optional[bool] = False) -> List[TableDtype]:
"""
Get features for a given column name.
"""
"""Get features for a given column name."""
if unique:
return list(self.data[column].unique())
else:
Expand All @@ -1300,6 +1311,12 @@ def save(
self,
path: str,
):
"""
Save table instance to a '.csv' file.

Args:
path (str) : Path to the output csv file.
"""
data: pd.DataFrame = self.data
os.makedirs(osp.dirname(path), exist_ok=True)
data.to_csv(path, index=False)
Expand All @@ -1316,10 +1333,13 @@ def __init__(
**kwargs,
) -> None:
"""
Constructor for TableFromCSV.
@param path: Path to csv file
@param sep: Delimiter to use.
@param encoding: Encoding to use for UTF when reading/writing (ex. 'utf-8').
Read a '.csv' file and compose a Table instance.

Args:
path (str) : Path to csv file.
dtype (optional, dict(str,str)) : Dictionay of column name -> type str ('str', 'int', or 'float').
sep (optional, str) : Delimiter to use.
encoding (optional, str) : Encoding to use for UTF when reading/writing (ex. 'utf-8').
"""
super().__init__(path, *args, **kwargs)

Expand Down Expand Up @@ -1348,6 +1368,12 @@ def __init__(
*args,
**kwargs,
):
"""
Read a pandas DataFrame and compose a Table instance.

Args:
data (DataFrame) : Data in pandas DataFrame format.
"""
super().__init__(data=data, *args, **kwargs)

if data is None:
Expand All @@ -1373,13 +1399,27 @@ def __init__(
*args,
**kwargs,
):
"""
Read a list of table row data and compose a Table instance.
The table row data is in dictionary format.

Args:
data (list(dict(str,str|int|float))) : A list of table row data.
"""
super().__init__(data=pd.DataFrame(data), *args, **kwargs)


class TableRow(MediaElement):
_type = MediaType.TABLE_ROW

def __init__(self, table: Table, index: int):
"""
TableRow media refers to a Table instance and its row index.

Args:
table (Table) : Table instance.
index (int) : Row index.
"""
if table is None:
raise ValueError("'table' can't be None")
if index < 0 or index >= table.shape[0]:
Expand All @@ -1389,13 +1429,23 @@ def __init__(self, table: Table, index: int):

@property
def table(self) -> Table:
"""Table instance"""
return self._table

@property
def index(self) -> int:
"""Row index"""
return self._index

def data(self, targets: Optional[List[str]] = None) -> Dict:
"""
Row data in dict format.

Args:
targets (optional, list(str)) : If this is specified,
the values corresponding to target colums will be returned.
Otherwise, whole row data will be returned.
"""
row = self.table.data.iloc[self.index]
if targets:
row = row[targets]
Expand Down
Loading