Skip to content

Checkpoint nemo2 migration#120

Open
Marlon666 wants to merge 3 commits intoAI-Hypercomputer:mainfrom
Marlon666:checkpoint-nemo2-migration
Open

Checkpoint nemo2 migration#120
Marlon666 wants to merge 3 commits intoAI-Hypercomputer:mainfrom
Marlon666:checkpoint-nemo2-migration

Conversation

@Marlon666
Copy link

@Marlon666 Marlon666 commented Feb 19, 2026

Update checkpoint log pattern matching to support NeMo 2's
'Global Checkpoint Save' message format, which provides epoch
timestamps and save duration directly in the log message.

Changes:

  • log_patterns.py: Update CHECKPOINT_WRITE_START regex to match
    NeMo 2's 'Global Checkpoint Save : Rank: X : Iteration: N :
    Start time: EPOCHs : Save duration: Xs' format
  • calculate_checkpoint_metrics.py: Extract start time as epoch
    float from the message instead of parsing log prefix timestamp;
    map end message step (N+1) to start iteration (N)
  • calculate_checkpoint_metrics_test.py: Add 14 unit tests covering
    pattern matching, timestamp parsing, checkpoint extraction, and
    write duration computation

docs: update checkpointing metrics README for NeMo 2

  • Specify NeMo 2 compatibility
  • Add sample output from a real NeMo 2 run
  • Add testing section with unittest command
  • Fix stray backtick in pip install command

How to add a new framework

  1. Create <framework>_parser.py, subclass LogParser
  2. Decorate with @register_parser
  3. Import in calculate_checkpoint_metrics.py
  4. Add tests — no core logic changes needed

Tested:

> python calculate_checkpoint_metrics.py \
    --gcs_logs_path gs://tess-benchmark-outputs/muzi-8b-dl-ckpt-20260217-175559
Analyzing file: muzi-8b-dl-ckpt-20260217-175559/nemo_log_globalrank-1_localrank-1.txt, Global rank: 1, Local rank: 1
Auto-detected log format: nemo2
Analyzing file: muzi-8b-dl-ckpt-20260217-175559/run_0/nemo_log_globalrank-2_localrank-2.txt, Global rank: 2, Local rank: 2
Auto-detected log format: nemo2
Analyzing file: muzi-8b-dl-ckpt-20260217-175559/nemo_log_globalrank-0_localrank-0.txt, Global rank: 0, Local rank: 0
Auto-detected log format: nemo2
Analyzing file: muzi-8b-dl-ckpt-20260217-175559/nemo_log_globalrank-3_localrank-3.txt, Global rank: 3, Local rank: 3
Auto-detected log format: nemo2
min checkpoint write duration: 4.9700000286102295s
max checkpoint write duration: 34.86500000953674s
average checkpoint write duration: 17.880999982357025s
checkpoint write time standard deviation: 12.443055009810607

Update checkpoint log pattern matching to support NeMo 2's
'Global Checkpoint Save' message format, which provides epoch
timestamps and save duration directly in the log message.

Changes:
- log_patterns.py: Update CHECKPOINT_WRITE_START regex to match
  NeMo 2's 'Global Checkpoint Save : Rank: X : Iteration: N :
  Start time: EPOCHs : Save duration: Xs' format
- calculate_checkpoint_metrics.py: Extract start time as epoch
  float from the message instead of parsing log prefix timestamp;
  map end message step (N+1) to start iteration (N)
- calculate_checkpoint_metrics_test.py: Add 14 unit tests covering
  pattern matching, timestamp parsing, checkpoint extraction, and
  write duration computation
- Specify NeMo 2 compatibility
- Add sample output from a real NeMo 2 run
- Add testing section with unittest command
- Fix stray backtick in pip install command
Add a plugin architecture for checkpointing metrics to support multiple
log formats (NeMo 1, NeMo 2, and future frameworks) with auto-detection.

- Add log_parser.py: abstract LogParser base class + registry
- Add nemo1_parser.py: NeMo 1 checkpoint log parser
- Add nemo2_parser.py: NeMo 2 checkpoint log parser
- Update calculate_checkpoint_metrics.py: use parser infrastructure,
  add --log_format CLI arg (default: auto)
- Update log_patterns.py: move framework-specific patterns to parsers
- Update tests: 29 tests covering both formats + auto-detection
- Update README.md: document multi-format support
@Marlon666 Marlon666 requested a review from mkmg February 24, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants