[mcore] qwen2moe support#1139
Conversation
|
There seems to be a bug with computing gsm8k value over multiple nodes traning.
|
|
Could you fix code format? Also, we have a qwen moe weight loader patcher yesterday.. |
can you show more details about your qwen moe weight loader patcher, I will check and merge the two implementations. |
Here: #1137 |
ETOgaosion
left a comment
There was a problem hiding this comment.
Great Job!
Note that there seems to be multiple weight converters currently, we may need to uniform them to a single functional unit.
Besides, transformer config needs some refine.
And maybe current dist checkpoint and converter need some ci test.
This PR is huge enough, merge first.
about conveter about transformer config, what direction is the refinement to? about the CI, the CI is necessary, let's add some in the next days. |
| ): | ||
| return init_mcore_model_dense( | ||
| tfconfig, hf_config, pre_process, post_process, share_embeddings_and_output_weights, value | ||
| from megatron.core.models.gpt.gpt_layer_specs import get_gpt_decoder_block_spec |
There was a problem hiding this comment.
as we use megatron.core in the all init functions, can we move to the header of this file to avoid duplicate import?
There was a problem hiding this comment.
Thanks! We will try if it does not influence the CPU initialization process.
There was a problem hiding this comment.
good suggestions, we may refine the code in the next PRs
| return transformer_layer_spec | ||
|
|
||
| assert tfconfig.normalization == "RMSNorm", "only RMSNorm is supported for now" | ||
| transformer_layer_spec = get_gpt_decoder_block_spec(tfconfig, use_transformer_engine=use_te) |
There was a problem hiding this comment.
nit: can we directly set use_transformer_engine to True instead of using one variable use_te?
There was a problem hiding this comment.
Thanks a lot. We may take this into consideration.
I have optimized the previous code in the new PR #1200. |
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in #923, #1137, and #1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in verl-project#923, verl-project#1137, and verl-project#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.

support qwen2moe structure to run with megatron-core
including:
TODO
add option to freeze the MoE router weight during training