Skip to content

feat: Distributed checkpointing#99

Merged
terrykong merged 36 commits intomainfrom
ashors/dist-ckpt
Apr 7, 2025
Merged

feat: Distributed checkpointing#99
terrykong merged 36 commits intomainfrom
ashors/dist-ckpt

Conversation

@ashors1
Copy link
Copy Markdown
Contributor

@ashors1 ashors1 commented Mar 31, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Closes #109
Closes #110

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 changed the title Distributed checkpointing feat: Distributed checkpointing Mar 31, 2025
Comment thread nemo_reinforcer/models/policy/hf_policy.py Outdated
Comment thread nemo_reinforcer/algorithms/sft.py
@terrykong
Copy link
Copy Markdown
Collaborator

Does this close #34?

@ashors1
Copy link
Copy Markdown
Contributor Author

ashors1 commented Mar 31, 2025

Does this close #34?

Yes, there are a few loose ends that I'd like to tie up before officially closing, but this is the first step

ashors1 added 2 commits March 31, 2025 15:34
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@terrykong
Copy link
Copy Markdown
Collaborator

Does HF load from this format automagically if someone did AutoModel.from_pretrained()?

@ashors1
Copy link
Copy Markdown
Contributor Author

ashors1 commented Apr 1, 2025

Does HF load from this format automagically if someone did AutoModel.from_pretrained()?

Not right now. HF compatibility is one of the things I need to work on before closing #34

Signed-off-by: ashors1 <ashors@nvidia.com>
Comment thread nemo_reinforcer/models/policy/hf_policy.py Outdated
Comment thread nemo_reinforcer/models/policy/hf_policy.py Outdated
Comment thread nemo_reinforcer/models/policy/hf_policy.py
Comment thread nemo_reinforcer/utils/checkpoint.py Outdated
Signed-off-by: ashors1 <ashors@nvidia.com>
Comment thread nemo_reinforcer/models/policy/hf_policy.py Outdated
ashors1 added 6 commits April 1, 2025 11:49
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1
Copy link
Copy Markdown
Contributor Author

ashors1 commented Apr 1, 2025

With the HF checkpoint support provided in this PR, we should also be able to close #110. Users can either save their checkpoints in HF format directly, or can convert the checkpoint from torch distributed format to HF format after training. Once the checkpoint is in the correct format, it can be loaded via AutoModel.from_pretrained, e.g.

uv run run_sft.py policy.model_name="/path/to/hf/checkpoint"

cc @terrykong

Comment thread examples/convert_dcp_to_hf.py Outdated
Comment thread examples/convert_dcp_to_hf.py Outdated
Comment thread examples/convert_dcp_to_hf.py
Comment thread nemo_reinforcer/utils/hf_checkpoint.py Outdated
@terrykong
Copy link
Copy Markdown
Collaborator

@ashors1 also could you add that ckpt converter script to our docs?

Signed-off-by: ashors1 <ashors@nvidia.com>
ashors1 added 2 commits April 3, 2025 13:06
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Apr 4, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Apr 4, 2025
@ashors1 ashors1 added Run CICD and removed Run CICD labels Apr 5, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Comment thread nemo_reinforcer/utils/hf_checkpoint.py
Comment thread tests/functional/sft.sh Outdated
Comment thread nemo_reinforcer/algorithms/grpo.py Outdated
Comment thread nemo_reinforcer/algorithms/sft.py Outdated
Comment thread nemo_reinforcer/models/policy/hf_policy.py Outdated
ashors1 added 3 commits April 7, 2025 11:25
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Apr 7, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Apr 7, 2025
@terrykong terrykong merged commit 5622163 into main Apr 7, 2025
11 checks passed
@terrykong terrykong deleted the ashors/dist-ckpt branch April 7, 2025 22:14
parthchadha pushed a commit that referenced this pull request Apr 11, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
KiddoZhu pushed a commit that referenced this pull request May 6, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support AutoModel.from_pretrained from reinforcer checkpoints Improve HF checkpointing time with torch dist checkpointing

4 participants