Refit regression for MoE models using Megatron-Bridge

Refit is slower with Megatron-Bridge than nemo.tron for moe models. To reproduce, use these branches:

baseline: https://github.com/NVIDIA-NeMo/RL/tree/yifu/before_mbridge
mbridge: https://github.com/NVIDIA-NeMo/RL/tree/yifu/mbridge_dsv3

DSv2-Lite:

`uv run python examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml grpo.val_batch_size=2 policy.model_name=deepseek-ai/DeepSeek-V2-Lite-Chat cluster.gpus_per_node=8 policy.megatron_cfg.pipeline_model_parallel_size=4 policy.megatron_cfg.num_layers_in_first_pipeline_stage=7 policy.megatron_cfg.num_layers_in_last_pipeline_stage=6 policy.max_total_sequence_length=1024 checkpointing.enabled=False checkpointing.save_period=5  grpo.val_period=-1 grpo.val_at_start=False grpo.max_val_samples=16 policy.megatron_cfg.expert_model_parallel_size=2 policy.megatron_cfg.apply_rope_fusion=False
`

Make sure to set a different `NRL_MEGATRON_CHECKPOINT_DIR` when testing with mbridge so it won't reuse the nemo.tron checkpoints.

<img width="640" height="301" alt="Image" src="https://github.com/user-attachments/assets/73672bf5-d8a1-47d4-91ba-3e422c4c2c94" />

<img width="636" height="306" alt="Image" src="https://github.com/user-attachments/assets/42d8d081-679a-4fc8-a627-3393e775ee95" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refit regression for MoE models using Megatron-Bridge #1044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refit regression for MoE models using Megatron-Bridge #1044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions