Skip to content

Refit regression for MoE models using Megatron-Bridge #1044

@yfw

Description

@yfw

Refit is slower with Megatron-Bridge than nemo.tron for moe models. To reproduce, use these branches:

baseline: https://github.com/NVIDIA-NeMo/RL/tree/yifu/before_mbridge
mbridge: https://github.com/NVIDIA-NeMo/RL/tree/yifu/mbridge_dsv3

DSv2-Lite:

uv run python examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml grpo.val_batch_size=2 policy.model_name=deepseek-ai/DeepSeek-V2-Lite-Chat cluster.gpus_per_node=8 policy.megatron_cfg.pipeline_model_parallel_size=4 policy.megatron_cfg.num_layers_in_first_pipeline_stage=7 policy.megatron_cfg.num_layers_in_last_pipeline_stage=6 policy.max_total_sequence_length=1024 checkpointing.enabled=False checkpointing.save_period=5 grpo.val_period=-1 grpo.val_at_start=False grpo.max_val_samples=16 policy.megatron_cfg.expert_model_parallel_size=2 policy.megatron_cfg.apply_rope_fusion=False

Make sure to set a different NRL_MEGATRON_CHECKPOINT_DIR when testing with mbridge so it won't reuse the nemo.tron checkpoints.

Image Image

Metadata

Metadata

Labels

bugSomething isn't workingdeepseekRelated to deepseek 671b

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions