Doc: add a environment to fix that the memory capacity is unbalanced by minleminzui · Pull Request #1105 · verl-project/verl

minleminzui · 2025-04-16T04:02:01Z

if we use sglang as the rollout engine, we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity is unbalanced, please refer to #5426 in sglang

why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using SGLang as the rollout engine in verl？

verl initializes a SGlangRollout module during rollout, which is used to evaluate/generate samples.
SGLangRollout will initialize VerlEngine, further initialize a torch. Distributed. DeviceMesh, used to support the TP.
DeviceMesh.init () internally checks the free video memory of all participating devices, and if the difference is too large (more than about 10%), it directly reports an error, preventing initialization failures or communication deadlock.

Why might there be inconsistent graphic memory？

Ray Distributed Actor loads the model at different times:

verl uses ray multi-process multi-gpu concurrent training, and each WorkerDict may be called at different times:
self.rollout = SGLangRollout(...)
different workers initialize the model at different times → different memory usage.

Delayed initialization causes memory bias

Some workers enter the model loading/infer process earlier than others, such as generate_sequences() or compute_log_prob().
The early-loaded worker video memory has been eaten by the model, and the late-loaded worker video memory is still empty → the graphic memory gap is large.

Verl+SGLang's TP initialization goes "all device broadcast", but there is no uniform release timing

SGLangRollout only needs to involve the part of the graphics card used by the rollout machine, but its VerlEngine initialization calls torch.distribut.init process group() and broadcast a bunch of weights. Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is reported.

Different loading modes of FSDP/TP models also cause deviations

if the following parameters are set

actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True

Some worker parameters are on the CPU, and some parameters are shard to the GPU in advance. This also creates an asymmetric distribution of video memory.

BearBiscuit05 · 2025-04-16T05:53:37Z

LGTM, but could you include more explanations in the document? As this issue can occur subtly in verl, we should ensure developers using verl can clearly understand the purpose of environment variables from the documentation when troubleshooting.

Co-authored-by: ocss884 <ocss.lin@gmail.com>

CLAassistant · 2025-04-16T12:52:53Z

All committers have signed the CLA.

minleminzui · 2025-04-16T13:09:20Z

LGTM, but could you include more explanations in the document? As this issue can occur subtly in verl, we should ensure developers using verl can clearly understand the purpose of environment variables from the documentation when troubleshooting.

ok， done

BearBiscuit05 · 2025-04-17T01:50:26Z

In the preview, it seems that actor.fsdp_config.param_offload=True, ref.fsdp_config.param_offload=True, and self.rollout = SGLangRollout(...) are not taking effect by rst?

docs/workers/sglang_worker.rst

minleminzui · 2025-04-17T02:50:32Z

In the preview, it seems that actor.fsdp_config.param_offload=True, ref.fsdp_config.param_offload=True, and self.rollout = SGLangRollout(...) are not taking effect by rst?

ok, done

…erl-project#1105) if we use sglang as the rollout engine, we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity is unbalanced, please refer to [verl-project#5426 in sglang](sgl-project/sglang#5426) # why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using SGLang as the rollout engine in verl？ 1. verl initializes a SGlangRollout module during rollout, which is used to evaluate/generate samples. 2. SGLangRollout will initialize VerlEngine, further initialize a torch. Distributed. DeviceMesh, used to support the TP. 3. DeviceMesh.init () internally checks the free video memory of all participating devices, and if the difference is too large (more than about 10%), it directly reports an error, preventing initialization failures or communication deadlock. # Why might there be inconsistent graphic memory？ ## Ray Distributed Actor loads the model at different times: verl uses ray multi-process multi-gpu concurrent training, and each `WorkerDict` may be called at different times: `self.rollout = SGLangRollout(...)` different workers initialize the model at different times → different memory usage. ## Delayed initialization causes memory bias Some workers enter the model loading/infer process earlier than others, such as `generate_sequences()` or `compute_log_prob()`. The early-loaded worker video memory has been eaten by the model, and the late-loaded worker video memory is still empty → the graphic memory gap is large. ## Verl+SGLang's TP initialization goes "all device broadcast", but there is no uniform release timing SGLangRollout only needs to involve the part of the graphics card used by the rollout machine, but its VerlEngine initialization calls torch.distribut.init process group() and broadcast a bunch of weights. Result in: Non-rollout cards also participate in communication; Then initialize DeviceMesh, and the error "inconsistent memory" is reported. ## Different loading modes of FSDP/TP models also cause deviations if the following parameters are set ``` actor.fsdp_config.param_offload=True ref.fsdp_config.param_offload=True ``` Some worker parameters are on the CPU, and some parameters are shard to the GPU in advance. This also creates an asymmetric distribution of video memory. --------- Co-authored-by: ocss884 <ocss.lin@gmail.com>

Doc: add a environment to fix that the memory capacity is unbalanced

ff4d29c

minleminzui force-pushed the doc branch from e6a611f to 6bff00d Compare April 16, 2025 06:13

more

85c670e

Co-authored-by: ocss884 <ocss.lin@gmail.com>

minleminzui force-pushed the doc branch from 6bff00d to 85c670e Compare April 16, 2025 12:52

more

a768db4

eric-haibin-lin requested changes Apr 17, 2025

View reviewed changes

docs/workers/sglang_worker.rst Outdated Show resolved Hide resolved

minleminzui added 2 commits April 17, 2025 10:45

Update sglang_worker.rst

a3cc342

Update sglang_worker.rst

fed7d7b

eric-haibin-lin approved these changes Apr 18, 2025

View reviewed changes

eric-haibin-lin merged commit c98fb31 into verl-project:main Apr 18, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc: add a environment to fix that the memory capacity is unbalanced#1105

Doc: add a environment to fix that the memory capacity is unbalanced#1105
eric-haibin-lin merged 5 commits intoverl-project:mainfrom
minleminzui:doc

minleminzui commented Apr 16, 2025 •

edited

Loading

Uh oh!

BearBiscuit05 commented Apr 16, 2025

Uh oh!

CLAassistant commented Apr 16, 2025 •

edited

Loading

Uh oh!

minleminzui commented Apr 16, 2025

Uh oh!

BearBiscuit05 commented Apr 17, 2025

Uh oh!

Uh oh!

minleminzui commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

minleminzui commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using SGLang as the rollout engine in verl？

Why might there be inconsistent graphic memory？

Ray Distributed Actor loads the model at different times:

Delayed initialization causes memory bias

Verl+SGLang's TP initialization goes "all device broadcast", but there is no uniform release timing

Different loading modes of FSDP/TP models also cause deviations

Uh oh!

BearBiscuit05 commented Apr 16, 2025

Uh oh!

CLAassistant commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minleminzui commented Apr 16, 2025

Uh oh!

BearBiscuit05 commented Apr 17, 2025

Uh oh!

Uh oh!

minleminzui commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

minleminzui commented Apr 16, 2025 •

edited

Loading

CLAassistant commented Apr 16, 2025 •

edited

Loading