Skip to content

Kate/splitter cli#81

Merged
zhiltsov-max merged 7 commits intodevelopfrom
kate/splitter-cli
Jan 14, 2021
Merged

Kate/splitter cli#81
zhiltsov-max merged 7 commits intodevelopfrom
kate/splitter-cli

Conversation

@jihyeonyi
Copy link
Copy Markdown

@jihyeonyi jihyeonyi commented Jan 12, 2021

Summary

This PR includes

  • supporting CLI for task-specific split
  • Revise re-identification split
  • Update documentation regarding the task-specific split

How to test

Unittest

$ python -m unittest -v tests/test_splitter.py

Testing classification split with imagenet dataset.

Notes: Imagenet doesn't support subsets but, checking subsets at the project level is enough here.

$ pip install .
$ datum project create -o imagenet
$ datum source add path <path-to-source> -f imagenet -p imagenet/
$ datum project transform -t classification_split -p imagenet/ -- --subset train:.5 --subset val:.2 --subset test:.3
$ datum project info -p imagenet-classification_split

Testing detection split with voc dataset

$ pip install .
$ datum project import -i <path-to-voc> -f voc
$ cd voc/
$ datum project transform -t detection_split -- --subset train:.5 --subset val:.2 --subset test:.3
$ datum project info -p voc-detection_split

Testing re-identification split with imagenet dataset.

Notes: Datumaro doesn't support re-id dataset now, so the classification dataset is used instead.

$ pip install .
$ datum project create -o imagenet
$ datum source add path <path-to-imagenet> -f imagenet -p imagenet/
$ datum project transform -t reidentification_split -p imagenet/ -- --subset train:.5 --subset val:.2 --subset test:.3 --query .5
$ datum project info -p imagenet-reidentification_split

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

zhiltsov-max
zhiltsov-max previously approved these changes Jan 13, 2021
Copy link
Copy Markdown
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the updated class descriptions for correctness.

Future updates could include:

  • ignoring attributes in classification split (for captions, descriptions and other technical attributes)
  • splitting using an attribute as label in classification split
  • using polygons and masks in detection split

Comment on lines +232 to +233
Produces a split with a specified ratio of images, avoiding having same
labels in different subsets.|n
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we avoid having the same person id or object id. It could be label or attribute if attr_for_id is specified.

Copy link
Copy Markdown
Author

@jihyeonyi jihyeonyi Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing is, actually train and val set share person id or object id. (Most person re-identification data doesn't have val set though). But they do not share IDs with test set.
I'm not sure how accurate the explanation should be.
If you feel the current explanation is sufficient, please leave it as it is.

@jihyeonyi
Copy link
Copy Markdown
Author

Please check the updated class descriptions for correctness.

Future updates could include:

  • ignoring attributes in classification split (for captions, descriptions and other technical attributes)
  • splitting using an attribute as label in classification split
  • using polygons and masks in detection split

Thank you for revising the descriptions.
And for future updates,

  1. Would you like to remove the attribute-based splitting or just make it optional?
    I think the latter is better.
  2. When you say 'splitting using an attribute as label', do you mean splitting using only attributes, regardless of labels?
  3. Does the detection task have polygons or masks? I thought it is for the segmentation task. Maybe I'm wrong.
    For your information, I'll add a splitter for the segmentation task. So why don't you add polygons or masks later?

@zhiltsov-max
Copy link
Copy Markdown
Contributor

  1. Would you like to remove the attribute-based splitting or just make it optional?

Optional, enabled by default.

  1. When you say 'splitting using an attribute as label', do you mean splitting using only attributes, regardless of labels?

I mean using a single attribute, like in re-id. Maybe, using some subset of them / ignoring some attributes.

  1. Does the detection task have polygons or masks?

In Mask R-CNN they are intermixed with segmentation task. I, personally, consider these types of annotations more or less interchangeable, because all these types can be used for training a segmentation and a detection algorithm.

@zhiltsov-max zhiltsov-max merged commit 1ee908f into develop Jan 14, 2021
@zhiltsov-max zhiltsov-max deleted the kate/splitter-cli branch February 16, 2021 10:55
zhiltsov-max added a commit to zhiltsov-max/datumaro that referenced this pull request Jan 27, 2026
* syncing util/mask_tools.py

* syncing util/image.py

* keeping exif unconditionally

* syncing components/media.py

* syncing components/importer.py

* syncing util/meta_file_util.py

* moving cli/contexts/project/diff.py to cli/util/compare.py

* moving Registry and PluginRegistry to components/registry.py

* syncing components/exporter.py

* syncing components/hl_ops.py

* syncing components/dataset.py

* limiting opencv version (due to opencv/opencv#25809)

* fixes

* upper case extension fix

* fixes

* always keeping exif info

* limiting opencv version

* Update src/datumaro/components/media.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* test for reading exif orientation

* changelog entry

* fixed isort

* fixed test

* fixed changelog

* Update src/datumaro/components/hl_ops/__init__.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* fixing filter examples

* hl_ops tests

* syncing plugins/data_formats/celeba

* syncing plugins/data_formats/cifar.py

* setting DETECT_CONFIDENCE for yolo formats

* syncing plugins/data_formats/image_dir.py

* better detection for yolo classification importer

* syncing plugins/data_formats/imagenet.py and plugins/data_formats/imagenet_txt.py

* syncing plugins/data_formats/camvid.py

* syncing tests/integration/cli/test_detect_format.py

* syncing cli/util/project.py

* syncing tests/integration/cli/test_filter.py

* syncing tests/integration/cli/test_transform.py

* yolo streaming exporter

* syncing plugins/data_formats/coco

* Update src/datumaro/components/media.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* Update src/datumaro/components/media.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* Update src/datumaro/components/media.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* Update tests/unit/test_video.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* Update src/datumaro/components/registry.py

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* coco find_images_dir do not fail if images folder doe not exist - because cvat needs to be able to export and then import dataset without images

* coco find_rootpath do not fail if path does not end with ANNOTATIONS_DIR - because cvat needs it

* fixes

* fix linters

* tests for HLOps.compare

* syncing tests/unit/test_image.py

* accounting for the new flag in cv2

* syncing components/importer.py

* fixes

* fixes

* fixes

* fixes

* tests in test_masks.py from upstream

* a bit of info on ImageColorChannel.UNCHANGED

* fixing wrong merge

* removing bad changes

* rolling back changes in test

* do not recollect subset names in StreamDatasetStorage if transformations do not change subsets

* fixes

* Refactor with_subset_dirs

* Support detect() calls with no return value

* Update importer detection confidence

* Lower the default confidence

* Align default format detection confidence in detector and importer

* Clean imports

* syncing tests/conftest.py and tests/unit/data_formats/conftest.py

* syncing imagenet tests

* test new yolo classification detetection behaviour

* syncing tests/unit/test_format_detection.py

* Apply suggestions from code review

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* fixes

* fixes

* Apply suggestions from code review

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>

* small fixes

* basic streaming tests for coco and yolo formats

* returning previous tests and behaviour for coco

* Improve function name

* raising error on unknown image id

* test coco streaming

* test yolo streaming

---------

Co-authored-by: Maxim Zhiltsov <zhiltsov.max35@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants