Skip to content

Fix: deterministic downsampling#1603

Open
mnoukhov wants to merge 4 commits intomainfrom
fix/deterministic-downsampling
Open

Fix: deterministic downsampling#1603
mnoukhov wants to merge 4 commits intomainfrom
fix/deterministic-downsampling

Conversation

@mnoukhov
Copy link
Copy Markdown
Contributor

Currently for dataset_mixer_list dataset_name number_of_samples if the number_of_samples is less than the size of the dataset, we randomly downsample. This means repeated runs aren't deterministic if you're doing debugging on a small subset of the data because you're sampling different data each time.

Switch to deterministic downsampling but keep random upsampling as we likely never do debugging with upsampling

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements deterministic downsampling for datasets to ensure reproducibility during training. The changes include a new entry in the CHANGELOG.md and updated logic in the dataset transformation module. Review feedback highlights a missing pull request reference in the changelog and suggests an optimization to return the dataset directly when the target size matches the original size, avoiding unnecessary indirection.

Comment thread CHANGELOG.md
Comment thread open_instruct/dataset_transformation.py Outdated
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test, but otherwise, LGTM.

# Create indices for upsampling
indices = []
elif target_size < original_size:
return self.dataset.select(range(target_size))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this, can we just set a set a seed that doesn't change and shuffle, then pick?

We can run into issues where the dataset as uploaded is not shuffled, and so picking the first n samples ends up not really being a random sample. We've run into issues in the past where someone put in e.g. a 50% downsample and just missed half the sources in the dataset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I generally specifically want it not shuffled but can add this as an arg

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea happy to have a no_shuffle arg or something similar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants