Implement conversion to the experimental Dataset format. by gdlg · Pull Request #1810 · open-edge-platform/datumaro

gdlg · 2025-08-06T13:36:01Z

Summary

This PR implements the conversion from the legacy to the experimental Dataset class. I will implement the conversion back to the legacy class in a separate PR.

Misc fixes:

Also implements __len__, __delitem__ and __iter__ in the Dataset class.
Fix bug when fetching ann_types() before the cache is initialised.

The conversion works in two steps: the first step analyse the existing dataset and generate the schema for the new dataset. The second step actually converts the data.

I have defined MediaConverter and AnnotationConverter base class which can be extended to support new media/annotation types. This PR implements the conversion logic but the conversion for specific media/annotation type will be implemented later.

Part of #1789

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have added the description of my changes into CHANGELOG.
I have updated the documentation accordingly

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2025 Intel Corporation
#
# SPDX-License-Identifier: MIT

* Also implements __len__ and __iter__in the Dataset class. * Fix bug when fetching ann_types() before the cache is initialised.

AlbertvanHouten · 2025-08-06T14:13:30Z

tests/unit/experimental/test_dataset.py

+    # Add third sample
+    sample3 = TestSample(
+        image=np.array([[[128, 64, 192]], [[96, 160, 32]]], dtype=np.uint8),
+        bbox=np.array([[0.9, 0.8, 0.7, 0.6]], dtype=np.float32),
+        image_info=ImageInfo(width=1, height=2),
+    )
+    dataset.append(sample3)
+    assert len(dataset) == 3


Adding this third sample seems redundant after having already tested two appends. It would make more sense to remove one here and test if the len still works properly.

Done, I have also implemented __delitem__ for that.

The approach is similar to #1810. The conversion works in two steps: the first step analyse the existing dataset and generate the media type, annotation type and categories for the new dataset. The second step actually converts the data. I have defined BackwardMediaConverter and BackwardAnnotationConverter base class which can be extended to support new media/annotation types. This PR implements the conversion logic but the conversion for specific media/annotation type will be implemented later. Follow-up from #1810. Fixes #1789  ### Summary  ### How to test  ### Checklist  - [x] I have added unit tests to cover my changes. - [x] I have added integration tests to cover my changes. - [x] I have added the description of my changes into [CHANGELOG](https://github.com/open-edge-platform/datumaro/blob/develop/CHANGELOG.md). - [ ] I have updated the [documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs) accordingly ### License - [ ] I submit _my code changes_ under the same [MIT License](https://github.com/open-edge-platform/datumaro/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern. - [ ] I have updated the license header for each file (see an example below). ```python # Copyright (C) 2025 Intel Corporation # # SPDX-License-Identifier: MIT ```

Implement conversion to the experimental Dataset format.

a3ceb1b

* Also implements __len__ and __iter__in the Dataset class. * Fix bug when fetching ann_types() before the cache is initialised.

gdlg requested a review from AlbertvanHouten August 6, 2025 13:36

Update changelog

1803c99

AlbertvanHouten approved these changes Aug 6, 2025

View reviewed changes

gdlg added 2 commits August 7, 2025 08:58

Fix path on Windows

b4ac963

Implement __delitem__.

9879a07

AlbertvanHouten approved these changes Aug 7, 2025

View reviewed changes

gdlg merged commit 561f0d2 into develop Aug 7, 2025
15 checks passed

gdlg mentioned this pull request Aug 7, 2025

Add conversion back from the experimental to the legacy dataset. #1811

Merged

6 tasks

gdlg deleted the gppayend/legacy-conversion branch August 18, 2025 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement conversion to the experimental Dataset format.#1810

Implement conversion to the experimental Dataset format.#1810
gdlg merged 4 commits intodevelopfrom
gppayend/legacy-conversion

gdlg commented Aug 6, 2025 •

edited

Loading

Uh oh!

AlbertvanHouten Aug 6, 2025

Uh oh!

gdlg Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gdlg commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to test

Checklist

License

Uh oh!

AlbertvanHouten Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

gdlg Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gdlg commented Aug 6, 2025 •

edited

Loading