Skip to content

Dataset splits do not have exactly the requested weights #292

@ageron

Description

@ageron

Short description
When I split the tf_flowers dataset into subsplits with weights 10, 15 and 75, I actually get datasets of size 400, 600, and 2670. This translates to 10.9%, 16.3%, 72.8%, which is pretty different from what I requested.
Moreover, apart from iterating through the whole datasets, there does not seem to be a way to know the size of the splits.

Environment information

  • Operating System: MacOSX 10.13.6
  • Python version: 3.6.8
  • tfds-nightly version: tfds-nightly-1.0.1.dev201903180105
  • tf-nightly-2.0-preview version: tf-nightly-2.0-preview-2.0.0.dev20190319

Reproduction instructions

import tensorflow_datasets as tfds

test_split, valid_split, train_split = tfds.Split.TRAIN.subsplit([10, 15, 75])

test_set = tfds.load("tf_flowers", split=test_split, as_supervised=True)
valid_set = tfds.load("tf_flowers", split=valid_split, as_supervised=True)
train_set = tfds.load("tf_flowers", split=train_split, as_supervised=True)

def dataset_length(dataset):
    count = 0
    for image in dataset:
        count += 1
    return count

print(dataset_length(test_set)) # 400
print(dataset_length(valid_set)) # 600
print(dataset_length(train_set)) # 2670

Expected behavior
I expected split sizes with the requested ratios (rounded up or down to the nearest integer): in this example, the correct sizes should have been 367, 550 and 2753 (or 551 and 2752).
I also expect to be able to know the subsplit sizes without iterating through the datasets.

Additional context
TFDS is cool.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions