Skip to content

Comments

[Mcore] context parallel#970

Merged
ETOgaosion merged 6 commits intoverl-project:mainfrom
ISEEKYAN:mcore_context_parallel
Apr 10, 2025
Merged

[Mcore] context parallel#970
ETOgaosion merged 6 commits intoverl-project:mainfrom
ISEEKYAN:mcore_context_parallel

Conversation

@ISEEKYAN
Copy link
Collaborator

@ISEEKYAN ISEEKYAN commented Apr 8, 2025

support context parallel for mcore backend.
Changes on:

  • configs
  • model loader
  • checkpint
  • single control dispatcher
  • forward preprocess and postprocess

@ISEEKYAN ISEEKYAN marked this pull request as ready for review April 8, 2025 06:35
@ccclyu
Copy link
Collaborator

ccclyu commented Apr 8, 2025

thanks a ton for quick support! May I know whether you have done some benchmarking or testing of training efficiency upon the context parallel?

@ISEEKYAN
Copy link
Collaborator Author

ISEEKYAN commented Apr 8, 2025

thanks a ton for quick support! May I know whether you have done some benchmarking or testing of training efficiency upon the context parallel?

I tried with 1 node with 8 H100, comparing tp4dp2cp1 with tp4dp1cp2. cp2(gray line) is slower than cp1 in this test. The result is reasonable because it is not a memory limited situation, with less data parallel and more communication. CP would be useful when the sequence length is larger, but so far, I have only focused on whether the functions have been implemented, and I haven't had the time to pay attention to further performance testing.

截屏2025-04-08 15 09 42

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccclyu please try it out and provide feedbacks~

dp_rank = mpu.get_data_parallel_rank()
#TODO: support ep
return os.path.join(checkpoint_path, f"optim", f"distrib_optim_pp{pp_rank}_tp{tp_rank}.pt")
return os.path.join(checkpoint_path, f"optim", f"distrib_optim_pp{pp_rank}_tp{tp_rank}_cp{cp_rank}_dp{dp_rank}.pt")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ETOgaosion due to the optimizer states are distributed across all gpus, the dp rank also should be saved separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get~ Later will sync to the doc.

@ETOgaosion ETOgaosion merged commit 9f405b4 into verl-project:main Apr 10, 2025
21 of 22 checks passed
yanfeng98 pushed a commit to yanfeng98/fork-verl that referenced this pull request Apr 11, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
yhyang201 pushed a commit to yhyang201/verl that referenced this pull request Apr 26, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants