[mcore] qwen2moe support by ISEEKYAN · Pull Request #1139 · verl-project/verl

ISEEKYAN · 2025-04-17T13:01:50Z

support qwen2moe structure to run with megatron-core
including:

qwen2moe config converter
qwen2moe model initializer
refactor the online weight converter from mcore to vllm
qwen2moe online weight converter
qwen2moe offline weight conversion script from hf to mcore
a script to run training qwen1.5moe_a2.7b with 4 nodes

TODO
add option to freeze the MoE router weight during training

ISEEKYAN · 2025-04-17T13:27:26Z

There seems to be a bug with computing gsm8k value over multiple nodes traning.
I trained qwen2-7b with 1/2/4 nodes, their scores are 0.88/0.44/0.22.
I trained qwen1.5-moe-a2.7b with 4 nodes(OOM with less nodes), which is reported to achieve 0.625 on GSM8k, the score is about 0.159, roughly 1/4 of 0.625.

vermouth1992 · 2025-04-18T02:54:36Z

Could you fix code format? Also, we have a qwen moe weight loader patcher yesterday..

ISEEKYAN · 2025-04-18T03:52:08Z

Could you fix weight format? Also, we have a qwen moe weight loader patcher yesterday..

can you show more details about your qwen moe weight loader patcher, I will check and merge the two implementations.

vermouth1992 · 2025-04-18T03:54:21Z

Could you fix weight format? Also, we have a qwen moe weight loader patcher yesterday..

can you show more details about your qwen moe weight loader patcher, I will check and merge the two implementations.

Here: #1137

ETOgaosion

Great Job!
Note that there seems to be multiple weight converters currently, we may need to uniform them to a single functional unit.
Besides, transformer config needs some refine.
And maybe current dist checkpoint and converter need some ci test.
This PR is huge enough, merge first.

ISEEKYAN · 2025-04-20T05:02:09Z

Great Job! Note that there seems to be multiple weight converters currently, we may need to uniform them to a single functional unit. Besides, transformer config needs some refine. And maybe current dist checkpoint and converter need some ci test. This PR is huge enough, merge first.

about conveter
I notice that the weight layout and names are difference across the different model structures, e.g. Qwen2ForCausalLM/Qwen2MoeForCausalLM/DeepseekV3ForCausalLM. So I introduced an OOP design, which is, every structure has a converter class that inherit from the base class, making it easy to reuse the common codes such as the attention parts.

about transformer config, what direction is the refinement to?

about the CI, the CI is necessary, let's add some in the next days.

ccclyu · 2025-04-21T18:10:20Z

verl/models/mcore/model_initializer.py

 ):
-    return init_mcore_model_dense(
-        tfconfig, hf_config, pre_process, post_process, share_embeddings_and_output_weights, value
+    from megatron.core.models.gpt.gpt_layer_specs import get_gpt_decoder_block_spec


as we use megatron.core in the all init functions, can we move to the header of this file to avoid duplicate import?

Thanks! We will try if it does not influence the CPU initialization process.

good suggestions, we may refine the code in the next PRs

ccclyu · 2025-04-21T18:12:58Z

verl/models/mcore/model_initializer.py

+        return transformer_layer_spec
+
+    assert tfconfig.normalization == "RMSNorm", "only RMSNorm is supported for now"
+    transformer_layer_spec = get_gpt_decoder_block_spec(tfconfig, use_transformer_engine=use_te)


nit: can we directly set use_transformer_engine to True instead of using one variable use_te?

Thanks a lot. We may take this into consideration.

BearBiscuit05 · 2025-04-22T15:29:41Z

Could you fix code format? Also, we have a qwen moe weight loader patcher yesterday..

I have optimized the previous code in the new PR #1200.

## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in #923, #1137, and #1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.

support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training

## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.

support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training

## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.

support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training

## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.

support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training

## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.

ISEEKYAN added 9 commits April 13, 2025 04:26

use mcore config_converer and model_initializer for more types of models

6155a65

remove megatron_config from actor/critic

8869168

reward model use gptmodel api, clean megatron_worker

9216811

mcore model_forward for registry

6c46c2a

Merge branch 'main' into mcore_refactor

a9c21cf

(WIP) support qwen2moe

e709dc3

qwen2moe config converter and weight converter

0775d36

add scripts to run qwen1.5moe_a2.7b

6113b10

Merge branch 'main' into mcore_qwen2moe

bbf41b6

ISEEKYAN marked this pull request as ready for review April 17, 2025 13:28

ISEEKYAN mentioned this pull request Apr 15, 2025

[mcore] verl+megatron development tracking #1033

Open

13 tasks

ISEEKYAN added 4 commits April 17, 2025 23:58

format

5f8d8a0

update scripts

d2376ec

Merge branch 'main' into mcore_qwen2moe

39e0658

fix for pre-commit

57d9671

ocss884 mentioned this pull request Apr 18, 2025

MCore zhaochenyang20/Awesome-ML-SYS-Tutorial#119

Open

ISEEKYAN added 3 commits April 18, 2025 23:14

Merge branch 'main' into mcore_qwen2moe

5181a99

fix bug of merge

7b66d82

compatible to mcore 0.12

941ab95

vermouth1992 requested a review from ETOgaosion April 19, 2025 12:32

ETOgaosion approved these changes Apr 20, 2025

View reviewed changes

ETOgaosion merged commit 4fa7ed6 into verl-project:main Apr 20, 2025
17 checks passed

ccclyu reviewed Apr 21, 2025

View reviewed changes

BearBiscuit05 mentioned this pull request Apr 22, 2025

[vllm] update moe patch for megatron and fsdp #1200

Merged

Comments

Conversation

ISEEKYAN commented Apr 17, 2025

Uh oh!

ISEEKYAN commented Apr 17, 2025

Uh oh!

vermouth1992 commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ISEEKYAN commented Apr 18, 2025

Uh oh!

vermouth1992 commented Apr 18, 2025

Uh oh!

ETOgaosion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ISEEKYAN commented Apr 20, 2025

Uh oh!

ccclyu Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

ISEEKYAN Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

BearBiscuit05 commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vermouth1992 commented Apr 18, 2025 •

edited

Loading