Skip to content

Doc: add a environment to fix that the memory capacity is unbalanced#1105

Merged
eric-haibin-lin merged 5 commits intoverl-project:mainfrom
minleminzui:doc
Apr 18, 2025
Merged

Doc: add a environment to fix that the memory capacity is unbalanced#1105
eric-haibin-lin merged 5 commits intoverl-project:mainfrom
minleminzui:doc

Conversation

@minleminzui
Copy link
Contributor

@minleminzui minleminzui commented Apr 16, 2025

if we use sglang as the rollout engine, we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity is unbalanced, please refer to #5426 in sglang

why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using SGLang as the rollout engine in verl?

  1. verl initializes a SGlangRollout module during rollout, which is used to evaluate/generate samples.

  2. SGLangRollout will initialize VerlEngine, further initialize a torch. Distributed. DeviceMesh, used to support the TP.

  3. DeviceMesh.init () internally checks the free video memory of all participating devices, and if the difference is too large (more than about 10%), it directly reports an error, preventing initialization failures or communication deadlock.

Why might there be inconsistent graphic memory?

Ray Distributed Actor loads the model at different times:

verl uses ray multi-process multi-gpu concurrent training, and each WorkerDict may be called at different times:
self.rollout = SGLangRollout(...)
different workers initialize the model at different times → different memory usage.

Delayed initialization causes memory bias

Some workers enter the model loading/infer process earlier than others, such as generate_sequences() or compute_log_prob().
The early-loaded worker video memory has been eaten by the model, and the late-loaded worker video memory is still empty → the graphic memory gap is large.

Verl+SGLang's TP initialization goes "all device broadcast", but there is no uniform release timing

SGLangRollout only needs to involve the part of the graphics card used by the rollout machine, but its VerlEngine initialization calls torch.distribut.init process group() and broadcast a bunch of weights. Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is reported.

Different loading modes of FSDP/TP models also cause deviations

if the following parameters are set

actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True

Some worker parameters are on the CPU, and some parameters are shard to the GPU in advance. This also creates an asymmetric distribution of video memory.

@BearBiscuit05
Copy link
Collaborator

LGTM, but could you include more explanations in the document? As this issue can occur subtly in verl, we should ensure developers using verl can clearly understand the purpose of environment variables from the documentation when troubleshooting.

Co-authored-by: ocss884 <ocss.lin@gmail.com>
@CLAassistant
Copy link

CLAassistant commented Apr 16, 2025

CLA assistant check
All committers have signed the CLA.

@minleminzui
Copy link
Contributor Author

LGTM, but could you include more explanations in the document? As this issue can occur subtly in verl, we should ensure developers using verl can clearly understand the purpose of environment variables from the documentation when troubleshooting.

ok, done

@BearBiscuit05
Copy link
Collaborator

In the preview, it seems that actor.fsdp_config.param_offload=True, ref.fsdp_config.param_offload=True, and self.rollout = SGLangRollout(...) are not taking effect by rst?

@minleminzui
Copy link
Contributor Author

In the preview, it seems that actor.fsdp_config.param_offload=True, ref.fsdp_config.param_offload=True, and self.rollout = SGLangRollout(...) are not taking effect by rst?

ok, done

@eric-haibin-lin eric-haibin-lin merged commit c98fb31 into verl-project:main Apr 18, 2025
1 check passed
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
…erl-project#1105)

if we use sglang as the rollout engine, we should export
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity
is unbalanced, please refer to [verl-project#5426 in
sglang](sgl-project/sglang#5426)

# why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using
SGLang as the rollout engine in verl?
1. verl initializes a SGlangRollout module during rollout, which is used
to evaluate/generate samples.

2. SGLangRollout will initialize VerlEngine, further initialize a torch.
Distributed. DeviceMesh, used to support the TP.

3. DeviceMesh.init () internally checks the free video memory of all
participating devices, and if the difference is too large (more than
about 10%), it directly reports an error, preventing initialization
failures or communication deadlock.

# Why might there be inconsistent graphic memory?
## Ray Distributed Actor loads the model at different times:
verl uses ray multi-process multi-gpu concurrent training, and each
`WorkerDict` may be called at different times:
`self.rollout = SGLangRollout(...)`
different workers initialize the model at different times → different
memory usage.

## Delayed initialization causes memory bias
Some workers enter the model loading/infer process earlier than others,
such as `generate_sequences()` or `compute_log_prob()`.
The early-loaded worker video memory has been eaten by the model, and
the late-loaded worker video memory is still empty → the graphic memory
gap is large.

## Verl+SGLang's TP initialization goes "all device broadcast", but
there is no uniform release timing
SGLangRollout only needs to involve the part of the graphics card used
by the rollout machine, but its VerlEngine initialization calls
torch.distribut.init process group() and broadcast a bunch of weights.
Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is
reported.

## Different loading modes of FSDP/TP models also cause deviations
if the following parameters are set
```
actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True
```

Some worker parameters are on the CPU, and some parameters are shard to
the GPU in advance. This also creates an asymmetric distribution of
video memory.

---------

Co-authored-by: ocss884 <ocss.lin@gmail.com>
yhyang201 pushed a commit to yhyang201/verl that referenced this pull request Apr 26, 2025
…erl-project#1105)

if we use sglang as the rollout engine, we should export
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity
is unbalanced, please refer to [verl-project#5426 in
sglang](sgl-project/sglang#5426)

# why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using
SGLang as the rollout engine in verl?
1. verl initializes a SGlangRollout module during rollout, which is used
to evaluate/generate samples.

2. SGLangRollout will initialize VerlEngine, further initialize a torch.
Distributed. DeviceMesh, used to support the TP.

3. DeviceMesh.init () internally checks the free video memory of all
participating devices, and if the difference is too large (more than
about 10%), it directly reports an error, preventing initialization
failures or communication deadlock.

# Why might there be inconsistent graphic memory?
## Ray Distributed Actor loads the model at different times:
verl uses ray multi-process multi-gpu concurrent training, and each
`WorkerDict` may be called at different times:
`self.rollout = SGLangRollout(...)`
different workers initialize the model at different times → different
memory usage.

## Delayed initialization causes memory bias
Some workers enter the model loading/infer process earlier than others,
such as `generate_sequences()` or `compute_log_prob()`.
The early-loaded worker video memory has been eaten by the model, and
the late-loaded worker video memory is still empty → the graphic memory
gap is large.

## Verl+SGLang's TP initialization goes "all device broadcast", but
there is no uniform release timing
SGLangRollout only needs to involve the part of the graphics card used
by the rollout machine, but its VerlEngine initialization calls
torch.distribut.init process group() and broadcast a bunch of weights.
Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is
reported.

## Different loading modes of FSDP/TP models also cause deviations
if the following parameters are set
```
actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True
```

Some worker parameters are on the CPU, and some parameters are shard to
the GPU in advance. This also creates an asymmetric distribution of
video memory.

---------

Co-authored-by: ocss884 <ocss.lin@gmail.com>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…erl-project#1105)

if we use sglang as the rollout engine, we should export
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity
is unbalanced, please refer to [verl-project#5426 in
sglang](sgl-project/sglang#5426)

# why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using
SGLang as the rollout engine in verl?
1. verl initializes a SGlangRollout module during rollout, which is used
to evaluate/generate samples.

2. SGLangRollout will initialize VerlEngine, further initialize a torch.
Distributed. DeviceMesh, used to support the TP.

3. DeviceMesh.init () internally checks the free video memory of all
participating devices, and if the difference is too large (more than
about 10%), it directly reports an error, preventing initialization
failures or communication deadlock.

# Why might there be inconsistent graphic memory?
## Ray Distributed Actor loads the model at different times:
verl uses ray multi-process multi-gpu concurrent training, and each
`WorkerDict` may be called at different times:
`self.rollout = SGLangRollout(...)`
different workers initialize the model at different times → different
memory usage.

## Delayed initialization causes memory bias
Some workers enter the model loading/infer process earlier than others,
such as `generate_sequences()` or `compute_log_prob()`.
The early-loaded worker video memory has been eaten by the model, and
the late-loaded worker video memory is still empty → the graphic memory
gap is large.

## Verl+SGLang's TP initialization goes "all device broadcast", but
there is no uniform release timing
SGLangRollout only needs to involve the part of the graphics card used
by the rollout machine, but its VerlEngine initialization calls
torch.distribut.init process group() and broadcast a bunch of weights.
Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is
reported.

## Different loading modes of FSDP/TP models also cause deviations
if the following parameters are set
```
actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True
```

Some worker parameters are on the CPU, and some parameters are shard to
the GPU in advance. This also creates an asymmetric distribution of
video memory.

---------

Co-authored-by: ocss884 <ocss.lin@gmail.com>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…erl-project#1105)

if we use sglang as the rollout engine, we should export
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity
is unbalanced, please refer to [verl-project#5426 in
sglang](sgl-project/sglang#5426)

# why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using
SGLang as the rollout engine in verl?
1. verl initializes a SGlangRollout module during rollout, which is used
to evaluate/generate samples.

2. SGLangRollout will initialize VerlEngine, further initialize a torch.
Distributed. DeviceMesh, used to support the TP.

3. DeviceMesh.init () internally checks the free video memory of all
participating devices, and if the difference is too large (more than
about 10%), it directly reports an error, preventing initialization
failures or communication deadlock.

# Why might there be inconsistent graphic memory?
## Ray Distributed Actor loads the model at different times:
verl uses ray multi-process multi-gpu concurrent training, and each
`WorkerDict` may be called at different times:
`self.rollout = SGLangRollout(...)`
different workers initialize the model at different times → different
memory usage.

## Delayed initialization causes memory bias
Some workers enter the model loading/infer process earlier than others,
such as `generate_sequences()` or `compute_log_prob()`.
The early-loaded worker video memory has been eaten by the model, and
the late-loaded worker video memory is still empty → the graphic memory
gap is large.

## Verl+SGLang's TP initialization goes "all device broadcast", but
there is no uniform release timing
SGLangRollout only needs to involve the part of the graphics card used
by the rollout machine, but its VerlEngine initialization calls
torch.distribut.init process group() and broadcast a bunch of weights.
Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is
reported.

## Different loading modes of FSDP/TP models also cause deviations
if the following parameters are set
```
actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True
```

Some worker parameters are on the CPU, and some parameters are shard to
the GPU in advance. This also creates an asymmetric distribution of
video memory.

---------

Co-authored-by: ocss884 <ocss.lin@gmail.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…erl-project#1105)

if we use sglang as the rollout engine, we should export
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity
is unbalanced, please refer to [verl-project#5426 in
sglang](sgl-project/sglang#5426)

# why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using
SGLang as the rollout engine in verl?
1. verl initializes a SGlangRollout module during rollout, which is used
to evaluate/generate samples.

2. SGLangRollout will initialize VerlEngine, further initialize a torch.
Distributed. DeviceMesh, used to support the TP.

3. DeviceMesh.init () internally checks the free video memory of all
participating devices, and if the difference is too large (more than
about 10%), it directly reports an error, preventing initialization
failures or communication deadlock.

# Why might there be inconsistent graphic memory?
## Ray Distributed Actor loads the model at different times:
verl uses ray multi-process multi-gpu concurrent training, and each
`WorkerDict` may be called at different times:
`self.rollout = SGLangRollout(...)`
different workers initialize the model at different times → different
memory usage.

## Delayed initialization causes memory bias
Some workers enter the model loading/infer process earlier than others,
such as `generate_sequences()` or `compute_log_prob()`.
The early-loaded worker video memory has been eaten by the model, and
the late-loaded worker video memory is still empty → the graphic memory
gap is large.

## Verl+SGLang's TP initialization goes "all device broadcast", but
there is no uniform release timing
SGLangRollout only needs to involve the part of the graphics card used
by the rollout machine, but its VerlEngine initialization calls
torch.distribut.init process group() and broadcast a bunch of weights.
Result in:

Non-rollout cards also participate in communication;

Then initialize DeviceMesh, and the error "inconsistent memory" is
reported.

## Different loading modes of FSDP/TP models also cause deviations
if the following parameters are set
```
actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True
```

Some worker parameters are on the CPU, and some parameters are shard to
the GPU in advance. This also creates an asymmetric distribution of
video memory.

---------

Co-authored-by: ocss884 <ocss.lin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants