Open
Conversation
Update checkpoint log pattern matching to support NeMo 2's 'Global Checkpoint Save' message format, which provides epoch timestamps and save duration directly in the log message. Changes: - log_patterns.py: Update CHECKPOINT_WRITE_START regex to match NeMo 2's 'Global Checkpoint Save : Rank: X : Iteration: N : Start time: EPOCHs : Save duration: Xs' format - calculate_checkpoint_metrics.py: Extract start time as epoch float from the message instead of parsing log prefix timestamp; map end message step (N+1) to start iteration (N) - calculate_checkpoint_metrics_test.py: Add 14 unit tests covering pattern matching, timestamp parsing, checkpoint extraction, and write duration computation
- Specify NeMo 2 compatibility - Add sample output from a real NeMo 2 run - Add testing section with unittest command - Fix stray backtick in pip install command
mkmg
reviewed
Feb 23, 2026
Add a plugin architecture for checkpointing metrics to support multiple log formats (NeMo 1, NeMo 2, and future frameworks) with auto-detection. - Add log_parser.py: abstract LogParser base class + registry - Add nemo1_parser.py: NeMo 1 checkpoint log parser - Add nemo2_parser.py: NeMo 2 checkpoint log parser - Update calculate_checkpoint_metrics.py: use parser infrastructure, add --log_format CLI arg (default: auto) - Update log_patterns.py: move framework-specific patterns to parsers - Update tests: 29 tests covering both formats + auto-detection - Update README.md: document multi-format support
mkmg
approved these changes
Feb 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update checkpoint log pattern matching to support NeMo 2's
'Global Checkpoint Save' message format, which provides epoch
timestamps and save duration directly in the log message.
Changes:
NeMo 2's 'Global Checkpoint Save : Rank: X : Iteration: N :
Start time: EPOCHs : Save duration: Xs' format
float from the message instead of parsing log prefix timestamp;
map end message step (N+1) to start iteration (N)
pattern matching, timestamp parsing, checkpoint extraction, and
write duration computation
docs: update checkpointing metrics README for NeMo 2
How to add a new framework
<framework>_parser.py, subclass LogParser@register_parserTested: