S3 Dirpath + Async Uploading Support for Default Checkpoints#9045
Merged
ericharper merged 43 commits intoNVIDIA-NeMo:mainfrom Jun 15, 2024
Merged
Conversation
68523e9 to
28077ce
Compare
…ting Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…ls into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
for more information, see https://pre-commit.ci
… file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
28077ce to
4c934fd
Compare
Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com>
mikolajblaz
reviewed
May 7, 2024
Collaborator
mikolajblaz
left a comment
There was a problem hiding this comment.
I added some high-level comments.
How does this PR relate to this one: NVIDIA/Megatron-LM#748?
| # If the future is complete, we can remove the temp file since we choose to clear the temp file when uploading. | ||
| try: | ||
| self._temp_files.remove(item[2]) | ||
| except: |
Check notice
Code scanning / CodeQL
Except block handles 'BaseException'
|
|
||
| try: | ||
| import awscrt | ||
| import s3transfer.crt |
Check notice
Code scanning / CodeQL
Unused import
Contributor
Author
The PR linked is for supporting S3 checkpointing for the distributed checkpoint format. |
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…ort S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
… nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
JesusPaz
pushed a commit
to JesusPaz/NeMo
that referenced
this pull request
Jun 18, 2024
…NeMo#9045) * Add S3 dirpath and asynchronous uploading support for basic checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update megtron_gpt_pretraining config to support S3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed unused imports Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add boto3 dependency for testing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove redundant setup() in nemo_model_checkpoint Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove comment line from import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * remove un-used s3transfer import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * add s3 prefix for s3-related checkpointing config Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * dummy sleep function lowered from 1 to 0.01 seconds Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove local_rank checking for rank, and use is_global_rank_zero. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * add tenacity dependency Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Add filtering of unfinished checkpoint to non-s3 checkpoint resuming Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * isort black reformatting Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Remove dependency requirement for checking if dirpath is an s3 path Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add rst doc for s3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove unneeded assert Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed dependencies Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Updated documentation on async save to S3 Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Slight fix in s3 checkpoint doc Signed-off-by: Alexander Zhang <alxzhang@amazon.com> --------- Signed-off-by: Alexander Zhang <alxzhang@amazon.com> Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com> Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
rohitrango
pushed a commit
to rohitrango/NeMo
that referenced
this pull request
Jun 25, 2024
…NeMo#9045) * Add S3 dirpath and asynchronous uploading support for basic checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update megtron_gpt_pretraining config to support S3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed unused imports Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add boto3 dependency for testing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove redundant setup() in nemo_model_checkpoint Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove comment line from import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * remove un-used s3transfer import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * add s3 prefix for s3-related checkpointing config Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * dummy sleep function lowered from 1 to 0.01 seconds Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove local_rank checking for rank, and use is_global_rank_zero. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * add tenacity dependency Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Add filtering of unfinished checkpoint to non-s3 checkpoint resuming Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * isort black reformatting Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Remove dependency requirement for checking if dirpath is an s3 path Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add rst doc for s3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove unneeded assert Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed dependencies Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Updated documentation on async save to S3 Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Slight fix in s3 checkpoint doc Signed-off-by: Alexander Zhang <alxzhang@amazon.com> --------- Signed-off-by: Alexander Zhang <alxzhang@amazon.com> Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com> Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
XuesongYang
pushed a commit
to paarthneekhara/NeMo
that referenced
this pull request
Jan 18, 2025
…NeMo#9045) * Add S3 dirpath and asynchronous uploading support for basic checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update megtron_gpt_pretraining config to support S3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed unused imports Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add boto3 dependency for testing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove redundant setup() in nemo_model_checkpoint Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove comment line from import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * remove un-used s3transfer import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * add s3 prefix for s3-related checkpointing config Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * dummy sleep function lowered from 1 to 0.01 seconds Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove local_rank checking for rank, and use is_global_rank_zero. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * add tenacity dependency Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Add filtering of unfinished checkpoint to non-s3 checkpoint resuming Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * isort black reformatting Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Remove dependency requirement for checking if dirpath is an s3 path Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add rst doc for s3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove unneeded assert Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed dependencies Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Updated documentation on async save to S3 Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Slight fix in s3 checkpoint doc Signed-off-by: Alexander Zhang <alxzhang@amazon.com> --------- Signed-off-by: Alexander Zhang <alxzhang@amazon.com> Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com> Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This PR creates a new CheckpointIO which allows users to upload checkpoints directly to S3.
Add a one line overview of what this PR aims to accomplish.
Collection: [NLP]
Changelog
S3CheckpointIOto enable strategy to upload default checkpoints to S3.checkpoint_file_utilsfile which includes identifies existing pathsS3CheckpointIOto cleanup existing checkpointsS3Utilsclass that contains helper methods for interacting with S3.exp_managercheck_resumeto only run on rank 0 when using an S3 dirpath (prevents throttling S3 due to check_resume operations).NeMoCheckpointConnectortoexp_managerto callresume_startusing the broadcasted checkpoint path.setup()override inNeMoModelCheckpointto broadcast thetrainer.ckpt_pathsince only Rank 0 has it aftercheck_resumerequirements_nlp.txtto include support for S3 filesystem as well as CRT support for faster uploading.Usage
Example from
megatron_gpt_pretraining.pyexampleConfig Updates:
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment
jenkinson the PR to trigger Jenkins CI.The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information