Avoid using skip() in hf_datasets #838

mori360 · 2025-02-12T18:36:08Z

Fix issue #809

The current self._data.skip(self._sample_idx) could not get the correct data for c_4 dataset.
Thus we switch to next() first before the fix is landed.

Test plan:
We reproduce the #809 by resuming from checkpoint at step 500, then compare the loss curve in 3 conditions:

the origin curve running from step 0 to 750
the resumed curve keeping .skip()
the resumed curve switch to next(), with this PR change

Warning
for c_4 dataset, if we resume from a large enough step, we call next() for self._sample_idx times, resuming from checkpoint would be much slower than using .skip()

Next step:
add unit test:

test the state_dict check between dcp.save/load and torch.save/load
test the difference between next() and .skip()

tianyu-l · 2025-02-13T00:22:05Z

torchtitan/datasets/hf_datasets.py

        if isinstance(self._data, Dataset) and self._sample_idx == len(self._data):
            return iter([])

-        return iter(self._data.skip(self._sample_idx))


I think we need to understand if skip causes error in both map-style and Iterable datasets, or only in the newly added IterableDataset case.
If it's the latter we should just revert #521, rather than universally use next for both, because it would make the healthy case slow too.

I would suggest that we land the PR first. It is better to have a slower checkpoint resume than an incorrect silent accuracy failure. It's blocking several accuracy verifications. Or at least we should make the default C4 dataset work for now.

tianyu-l

stamp to unblock, but we should follow up with more robust tests.

This PR makes resuming dataset iteration from a checkpoint fast again. This performance regression comes from #838. In that PR, `.skip` is removed for both map-style and iterable-style datasets for correctness reasons. However, `.skip` works as expected for map-style datasets, so the change can be reverted for that case. On the other hand, for iterable-style datasets, calling `.skip` after `split_dataset_by_node` splits the number of elements to skip **across the ranks** (e.g. calling `.skip(10)` after `split_dataset_by_node(<rank>, 2)` effectively skips 5 (`10 // 2 = 5`) elements on each rank), which isn'r what we want/expect, so removing `.skip` was justified there. Still, we can make the whole thing much faster using the [`state_dict` API](https://huggingface.co/docs/datasets/v3.5.0/en/stream#save-a-dataset-checkpoint-and-resume-iteration) for iterable-style datasets, which avoids re-iterating past shards/files when resuming.

Avoid using skip() in hf_datasets

6f23c07

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2025

mori360 added 5 commits February 12, 2025 11:00

lint

c27beb9

format

873a45d

format

6759603

lint

b01f822

Update hf_datasets.py

04d93b8

mori360 marked this pull request as ready for review February 12, 2025 20:34

mori360 requested review from fegin and tianyu-l February 12, 2025 20:34

tianyu-l reviewed Feb 13, 2025

View reviewed changes

tianyu-l linked an issue Feb 13, 2025 that may be closed by this pull request

Loss metrics dramatically change after resuming from checkpoint #809

Closed

tianyu-l approved these changes Feb 13, 2025

View reviewed changes

mori360 merged commit 0b0931c into pytorch:main Feb 13, 2025
6 checks passed

mariosasko mentioned this pull request Apr 9, 2025

Fast dataset resume #1082

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid using skip() in hf_datasets #838

Avoid using skip() in hf_datasets #838

Uh oh!

mori360 commented Feb 12, 2025 •

edited

Loading

Uh oh!

tianyu-l Feb 13, 2025

Uh oh!

fegin Feb 13, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Avoid using skip() in hf_datasets #838

Avoid using skip() in hf_datasets #838

Uh oh!

Conversation

mori360 commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mori360 commented Feb 12, 2025 •

edited

Loading