Short description
When I split the tf_flowers dataset into subsplits with weights 10, 15 and 75, I actually get datasets of size 400, 600, and 2670. This translates to 10.9%, 16.3%, 72.8%, which is pretty different from what I requested.
Moreover, apart from iterating through the whole datasets, there does not seem to be a way to know the size of the splits.
Environment information
- Operating System: MacOSX 10.13.6
- Python version: 3.6.8
tfds-nightly version: tfds-nightly-1.0.1.dev201903180105
tf-nightly-2.0-preview version: tf-nightly-2.0-preview-2.0.0.dev20190319
Reproduction instructions
import tensorflow_datasets as tfds
test_split, valid_split, train_split = tfds.Split.TRAIN.subsplit([10, 15, 75])
test_set = tfds.load("tf_flowers", split=test_split, as_supervised=True)
valid_set = tfds.load("tf_flowers", split=valid_split, as_supervised=True)
train_set = tfds.load("tf_flowers", split=train_split, as_supervised=True)
def dataset_length(dataset):
count = 0
for image in dataset:
count += 1
return count
print(dataset_length(test_set)) # 400
print(dataset_length(valid_set)) # 600
print(dataset_length(train_set)) # 2670
Expected behavior
I expected split sizes with the requested ratios (rounded up or down to the nearest integer): in this example, the correct sizes should have been 367, 550 and 2753 (or 551 and 2752).
I also expect to be able to know the subsplit sizes without iterating through the datasets.
Additional context
TFDS is cool.