Add support for importing legacy datumaro datasets with videos by AlbertvanHouten · Pull Request #2093 · open-edge-platform/datumaro

AlbertvanHouten · 2026-04-01T08:01:35Z

Checklist

I have added tests to cover my changes or documented any manual tests.
I have updated the documentation accordingly

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

codecov-commenter · 2026-04-01T08:05:30Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.75325% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...tumaro/experimental/converters/media_converters.py	93.75%	2 Missing and 1 partial ⚠️
...c/datumaro/experimental/legacy/media_converters.py	97.97%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR adds support for importing legacy Datumaro datasets containing videos (via VideoFrame) into the experimental dataset format, including mixed image+video datasets, and extends the media converter bridge to support unified media-path workflows.

Changes:

Added forward/backward legacy media converters for VideoFrame and mixed (image + video frame) datasets, and updated forward-converter selection strategy accordingly.
Introduced converter utilities to promote VideoFramePathField → MediaPathField and to extract MediaInfoField from MediaPathField.
Added extensive unit tests covering converter behavior, dataset analysis, and round-trip conversions for video/mixed datasets.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/unit/experimental/legacy/test_legacy_media_converters.py	Adds unit tests for new legacy video + mixed media converters.
tests/unit/experimental/legacy/test_legacy_dataset_converters.py	Adds tests for forward media converter strategy, analysis, convert_from/to_legacy, and round-trips for video/mixed datasets.
tests/unit/experimental/converters/test_media_bridge_converters.py	Adds tests for new media bridge converters (`VideoFramePath`→`MediaPath`, `MediaPath`→`MediaInfo`).
src/datumaro/experimental/legacy/register_legacy_converters.py	Changes forward media converter selection strategy; registers new backward converters.
src/datumaro/experimental/legacy/media_converters.py	Implements forward/backward converters for `VideoFrame` and mixed datasets.
src/datumaro/experimental/legacy/init.py	Exports newly added legacy media converters.
src/datumaro/experimental/converters/media_converters.py	Adds `VideoFramePathToMediaPathConverter` and `MediaPathToMediaInfoConverter`.
src/datumaro/experimental/converters/init.py	Exposes newly added experimental converters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T08:15:30Z

src/datumaro/experimental/legacy/register_legacy_converters.py

+    Strategy:
+    - If the dataset contains *any* video frames (mixed or video-only),
+      the :class:`ForwardMixedMediaConverter` is used with a unified
+      ``media_path_field``.
+    - If the dataset contains *only* images (no video frames at all),
+      the :class:`ForwardImageMediaConverter` is used with an
+      ``image_path_field``.
+
    Args:
        dataset: Legacy dataset to create converter from
        semantic: The semantic type for the converted fields
        name_prefix: Prefix to prepend to all field names
    """
-    # Get the dataset's media type
-    media_type = cast("type[MediaElement[Any]]", dataset.media_type())
-
-    # Try converter registered for this specific media type
-    if media_type in _media_converter_classes:
-        converter_class = _media_converter_classes[media_type]
-        return converter_class.create(dataset, semantic, name_prefix)
-
-    return None
+    # Try mixed-media converter first — it activates when any VideoFrame
+    # items are present (mixed or video-only datasets).
+    mixed = ForwardMixedMediaConverter.create(dataset, semantic, name_prefix)
+    if mixed is not None:
+        return mixed
+
+    # No video frames found — try the image-only converter.
+    return ForwardImageMediaConverter.create(dataset, semantic, name_prefix)


This change makes get_forward_media_converter ignore the forward media-converter registry entirely, which is a behavioral/API regression for legacy datasets whose media type is neither image-only nor (mixed) video frames (e.g., any custom or future registered media converter). Consider preserving the registry-based lookup as a final fallback: try mixed-first, then image-only, then fall back to the registered converter by dataset.media_type() (or an iteration of registered converters) so existing extensibility continues to work.

src/datumaro/experimental/legacy/media_converters.py

Copilot · 2026-04-01T08:15:30Z

src/datumaro/experimental/legacy/media_converters.py

+        elif isinstance(item.media, Image) and isinstance(item.media, FromFileMixin):
+            result[key] = LazyImage(path=item.media.path)


In mixed datasets, images that are not FromFileMixin (e.g., ImageFromData / byte-backed images) will silently produce no media value, which can lead to null/missing media for those samples after conversion. Since this converter is selected for any dataset containing at least one VideoFrame, this can break mixed datasets that include non-file images. Consider either (a) supporting the same image variants as ForwardImageMediaConverter (bytes/callables/info), or (b) raising a clear exception when encountering unsupported image media in mixed mode so the failure is explicit.

Suggested change

elif isinstance(item.media, Image) and isinstance(item.media, FromFileMixin):

result[key] = LazyImage(path=item.media.path)

elif isinstance(item.media, Image):

if isinstance(item.media, FromFileMixin):

result[key] = LazyImage(path=item.media.path)

else:

raise TypeError(

"ForwardMixedMediaConverter does not support non-file-based "

"images (e.g., byte/numpy-backed Image instances) in mixed "

"datasets. Use a converter that supports in-memory images "

"or ensure all images are file-based when mixing with video frames."

)

Copilot · 2026-04-01T08:15:31Z

src/datumaro/experimental/legacy/media_converters.py

+    def convert_to_legacy_media(self, sample: Sample) -> MediaElement[Any]:
+        """Convert a LazyImage or LazyVideoFrame back to the appropriate legacy media.
+
+        Returns Image for LazyImage values and VideoFrame for LazyVideoFrame values.
+        """
+        value = getattr(sample, self.media_attr)
+
+        if isinstance(value, LazyVideoFrame):
+            video_path = str(value.video_path)
+            if video_path not in self._video_cache:
+                self._video_cache[video_path] = Video(path=video_path)
+            video = self._video_cache[video_path]
+            return VideoFrame(video=video, index=value.frame_index)
+
+        if isinstance(value, LazyImage):
+            return Image.from_file(path=str(value.path))  # pyright: ignore[reportUnknownMemberType]
+
+        raise TypeError(f"Expected LazyImage or LazyVideoFrame, got {type(value)}")


BackwardMixedMediaConverter will raise a TypeError when the unified media value is None (or otherwise missing), but legacy DatasetItem media can be None (and the forward converters already treat media=None as a valid case by producing {}). To avoid breaking round-trips or conversions with missing media, handle None explicitly (e.g., return None so the legacy DatasetItem ends up with no media) and reserve TypeError for truly unexpected value types.

Copilot · 2026-04-01T08:15:31Z

src/datumaro/experimental/converters/media_converters.py

+            if frame_idx is not None:
+                # Video frame — extract video metadata
+                video_info = extract_video_info(path_str)
+                rows.append(
+                    {
+                        "width": video_info.width,
+                        "height": video_info.height,
+                        "fps": float(video_info.fps) if video_info.fps else None,
+                        "total_frames": video_info.total_frames,
+                        "duration": float(video_info.duration) if video_info.duration else None,
+                        "codec": video_info.codec,
+                        "frame_index": int(frame_idx),
+                    }
+                )
+            else:
+                # Standalone image — read dimensions only
+                p = _Path(path_str)
+                if not p.exists():
+                    rows.append(None)
+                    continue


For the video-frame path branch, the converter doesn’t check file existence and doesn’t handle failures from extract_video_info(path_str) (e.g., missing/unreadable video). This can raise exceptions during conversion, whereas the image branch explicitly returns None for missing files. Consider aligning behavior by checking existence for video paths too and/or wrapping extract_video_info in error handling that returns None when metadata cannot be extracted.

Copilot · 2026-04-01T08:15:31Z

src/datumaro/experimental/converters/media_converters.py

+        rows: list[dict[str, Any] | None] = []
+
+        for i in range(len(df)):
+            path = df[input_col][i]
+            frame_idx = df[frame_idx_col][i] if frame_idx_col in df.columns else None


This converter iterates row-by-row in Python and performs per-row file I/O (PIL open / video metadata extraction), which will be a bottleneck on large datasets. If this converter is expected to run on sizeable framesets, consider using a Polars-native approach (pl.map_elements / pl.struct(...).map_elements) to reduce Python overhead, and add caching for repeated paths (common for video frames) so the same video metadata isn’t recomputed for every frame.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Add support for importing legacy datumaro datasets with videos

c63901a

Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

AlbertvanHouten requested a review from Copilot April 1, 2026 08:07

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Update src/datumaro/experimental/legacy/media_converters.py

8fd3ebe

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Albert van Houten <albert.van.houten@intel.com>

Copilot started reviewing on behalf of AlbertvanHouten April 1, 2026 08:56 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for importing legacy datumaro datasets with videos#2093

Add support for importing legacy datumaro datasets with videos#2093
AlbertvanHouten wants to merge 2 commits intodevelopfrom
albert/import-legacy-videos

AlbertvanHouten commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		elif isinstance(item.media, Image) and isinstance(item.media, FromFileMixin):
		result[key] = LazyImage(path=item.media.path)

-        elif isinstance(item.media, Image) and isinstance(item.media, FromFileMixin):
-            result[key] = LazyImage(path=item.media.path)
+        elif isinstance(item.media, Image):
+            if isinstance(item.media, FromFileMixin):
+                result[key] = LazyImage(path=item.media.path)
+            else:
+                raise TypeError(
+                    "ForwardMixedMediaConverter does not support non-file-based "
+                    "images (e.g., byte/numpy-backed Image instances) in mixed "
+                    "datasets. Use a converter that supports in-memory images "
+                    "or ensure all images are file-based when mixing with video frames."
+                )

Conversation

AlbertvanHouten commented Apr 1, 2026

Checklist

Uh oh!

codecov-commenter commented Apr 1, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants