Conversation
There was a problem hiding this comment.
Code Review
This pull request implements deterministic downsampling for datasets to ensure reproducibility during training. The changes include a new entry in the CHANGELOG.md and updated logic in the dataset transformation module. Review feedback highlights a missing pull request reference in the changelog and suggests an optimization to return the dataset directly when the target size matches the original size, avoiding unnecessary indirection.
finbarrtimbers
left a comment
There was a problem hiding this comment.
Please add a test, but otherwise, LGTM.
| # Create indices for upsampling | ||
| indices = [] | ||
| elif target_size < original_size: | ||
| return self.dataset.select(range(target_size)) |
There was a problem hiding this comment.
Rather than this, can we just set a set a seed that doesn't change and shuffle, then pick?
We can run into issues where the dataset as uploaded is not shuffled, and so picking the first n samples ends up not really being a random sample. We've run into issues in the past where someone put in e.g. a 50% downsample and just missed half the sources in the dataset.
There was a problem hiding this comment.
Hmm, I generally specifically want it not shuffled but can add this as an arg
There was a problem hiding this comment.
yea happy to have a no_shuffle arg or something similar!
Currently for
dataset_mixer_list dataset_name number_of_samplesif thenumber_of_samplesis less than the size of the dataset, we randomly downsample. This means repeated runs aren't deterministic if you're doing debugging on a small subset of the data because you're sampling different data each time.Switch to deterministic downsampling but keep random upsampling as we likely never do debugging with upsampling