Fix: deterministic downsampling by mnoukhov · Pull Request #1603 · allenai/open-instruct

mnoukhov · 2026-04-11T18:13:11Z

Currently for dataset_mixer_list dataset_name number_of_samples if the number_of_samples is less than the size of the dataset, we randomly downsample. This means repeated runs aren't deterministic if you're doing debugging on a small subset of the data because you're sampling different data each time.

Switch to deterministic downsampling but keep random upsampling as we likely never do debugging with upsampling

gemini-code-assist

Code Review

This pull request implements deterministic downsampling for datasets to ensure reproducibility during training. The changes include a new entry in the CHANGELOG.md and updated logic in the dataset transformation module. Review feedback highlights a missing pull request reference in the changelog and suggests an optimization to return the dataset directly when the target size matches the original size, avoiding unnecessary indirection.

finbarrtimbers

Please add a test, but otherwise, LGTM.

hamishivi · 2026-04-13T16:57:21Z

-        # Create indices for upsampling
-        indices = []
+        elif target_size < original_size:
+            return self.dataset.select(range(target_size))


Rather than this, can we just set a set a seed that doesn't change and shuffle, then pick?

We can run into issues where the dataset as uploaded is not shuffled, and so picking the first n samples ends up not really being a random sample. We've run into issues in the past where someone put in e.g. a 50% downsample and just missed half the sources in the dataset.

Hmm, I generally specifically want it not shuffled but can add this as an arg

yea happy to have a no_shuffle arg or something similar!

mnoukhov added 2 commits April 11, 2026 18:11

deterministic downsampling

a0eddf0

add pr

9035e46

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread CHANGELOG.md

Comment thread open_instruct/dataset_transformation.py Outdated

mnoukhov and others added 2 commits April 11, 2026 18:32

cleaner select_samples

1549773

Merge branch 'main' into fix/deterministic-downsampling

ee883a4

finbarrtimbers requested changes Apr 13, 2026

View reviewed changes

hamishivi requested changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: deterministic downsampling#1603

Fix: deterministic downsampling#1603
mnoukhov wants to merge 4 commits intomainfrom
fix/deterministic-downsampling

mnoukhov commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers left a comment

Uh oh!

hamishivi Apr 13, 2026

Uh oh!

mnoukhov Apr 13, 2026

Uh oh!

hamishivi Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mnoukhov commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers left a comment

Choose a reason for hiding this comment

Uh oh!

hamishivi Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mnoukhov Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

hamishivi Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants