Skip to content

Inplace save & update dataset#102

Merged
zhiltsov-max merged 24 commits intodevelopfrom
zm/inplace-save
Feb 13, 2021
Merged

Inplace save & update dataset#102
zhiltsov-max merged 24 commits intodevelopfrom
zm/inplace-save

Conversation

@zhiltsov-max
Copy link
Copy Markdown
Contributor

@zhiltsov-max zhiltsov-max commented Feb 8, 2021

Summary

  • Dataset operations are finally made lazy
  • Transforms can be performed lazily for a Dataset
  • Dataset implements caching for input source. Multiple sources are immediately merged.
  • The order of elements in a Dataset is maintained, but is not guaranteed to be the same after saving and loading
  • Added partial saving interface for datasets (for in-place dataset updates in the same format)
  • Implemented partial saving for Datumaro format
  • Extended Dataset interface with cache control, changed data info, source path and format info
  • Dataset.get() returns None instead of raising an exception when the item doesn't exist
  • Supported in operator for Dataset
  • Added get operation for Extractor
  • Added type annotations for Dataset class
  • Extended API model with new interfaces
  • Converter interface is extended by optional operation to support partial data update (patch()). The default implementation uses the regular full-dataset saving.
  • Added specific error types to be used instead of generic Exception
  • Dataset can track updates and generate patches. Transform is considered updating the whole dataset
  • Dataset.get_subset provides modifiable slices

TBD:

  • update docs
  • update CVAT
  • implement partial save in formats

How to test

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@zhiltsov-max zhiltsov-max changed the title Inplace save & update dataset [WIP] Inplace save & update dataset Feb 8, 2021
@zhiltsov-max zhiltsov-max changed the title [WIP] Inplace save & update dataset Inplace save & update dataset Feb 10, 2021
@nmanovic
Copy link
Copy Markdown

@zhiltsov-max , do we have any difficulties to solve that: "The order of elements in a Dataset is maintained, but is not guaranteed to be the same after saving and loading"?

I'm not sure that it is critical, but I prefer deterministic behavious if it is easy to achieve.

@zhiltsov-max
Copy link
Copy Markdown
Contributor Author

@nmanovic, if a format represents a dataset with several subset files, it is impossible to reproduce initial item ordering.

Example:

Dataset:
item(1, 'train')
item(2, 'val')
item(3, 'train')

.save():

train_list.txt
val_list.txt

.load()

item(1, 'train')
item(3, 'train')
item(2, 'val')

@nmanovic
Copy link
Copy Markdown

@zhiltsov-max , should we update documentation? Are you planning to add some short tutorials for new use cases?

@zhiltsov-max
Copy link
Copy Markdown
Contributor Author

@nmanovic, I'd prefer to update documentation after new API for operations are introduced, otherwise the changes are hard to perceive. Small catchy examples were added earlier, they still work - but now they also have good performance because of added transparent caching. Thorough documentation will be added with r0.2 (VCS) / r0.3 (stable API) and stable API introduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants