From ff4d29c4e184110efa76ca542a82efce0e9a6a81 Mon Sep 17 00:00:00 2001
From: minleminzui <2969413251@qq.com>
Date: Wed, 16 Apr 2025 03:59:05 +0000
Subject: [PATCH 1/5] Doc: add a environment to fix that the memory capacity is
 unbalanced

---
 docs/workers/sglang_worker.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst
index 13e0066503f..b649da4eec2 100644
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
@@ -37,6 +37,7 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
 
 .. code-block:: bash
 
+    export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK
     PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
         data.train_files=$HOME/data/gsm8k/train.parquet \
         data.val_files=$HOME/data/gsm8k/test.parquet \

From 85c670ee848f724cf78576579574c50dbf56d28c Mon Sep 17 00:00:00 2001
From: minleminzui <2969413251@qq.com>
Date: Wed, 16 Apr 2025 04:07:48 +0000
Subject: [PATCH 2/5] more

Co-authored-by: ocss884 <ocss.lin@gmail.com>
---
 docs/workers/sglang_worker.rst | 46 +++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst
index b649da4eec2..4e7eb6ad1c5 100644
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
@@ -34,10 +34,54 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
     python3 examples/data_preprocess/gsm8k.py
 
 2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs:
+    SGLang + Verl
+    ^^^^^^^^^^^^^
+
+    1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples.
+
+    2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP).
+
+    3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks.
+
+    Why might there be inconsistent GPU memory?
+    """""""""""""""""""""""""""""""""""""""""""
+
+    **1. Ray Distributed Actor loads the model at different times**
+
+    ``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
+
+    ::
+
+    self.rollout = SGLangRollout(...)
+
+    Different workers initialize the model at different times → different memory usage.
+
+    **2. Delayed initialization causes memory bias**
+
+    Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others.  
+    Early workers already use up GPU memory → late workers still have empty memory → memory difference appears.
+
+    **3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing**
+
+    Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so:
+
+    - Non-rollout GPUs also join the communication.
+    - Later on, ``DeviceMesh`` init will fail due to "inconsistent memory".
+
+    **4. Different FSDP/TP loading behaviors also lead to mismatch**
+
+    If using:
+
+    ::
+
+    actor.fsdp_config.param_offload=True  
+    ref.fsdp_config.param_offload=True
+
+    Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.
 
 .. code-block:: bash
 
-    export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK
+    export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True  \
     PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
         data.train_files=$HOME/data/gsm8k/train.parquet \
         data.val_files=$HOME/data/gsm8k/test.parquet \

From a768db4f282e2e90878ecd1066654ad321d5a66b Mon Sep 17 00:00:00 2001
From: minleminzui <2969413251@qq.com>
Date: Wed, 16 Apr 2025 13:07:36 +0000
Subject: [PATCH 3/5] more

---
 docs/workers/sglang_worker.rst | 89 +++++++++++++++++-----------------
 1 file changed, 45 insertions(+), 44 deletions(-)

diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst
index 4e7eb6ad1c5..daa2cc6c230 100644
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
@@ -34,50 +34,6 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
     python3 examples/data_preprocess/gsm8k.py
 
 2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs:
-    SGLang + Verl
-    ^^^^^^^^^^^^^
-
-    1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples.
-
-    2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP).
-
-    3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks.
-
-    Why might there be inconsistent GPU memory?
-    """""""""""""""""""""""""""""""""""""""""""
-
-    **1. Ray Distributed Actor loads the model at different times**
-
-    ``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
-
-    ::
-
-    self.rollout = SGLangRollout(...)
-
-    Different workers initialize the model at different times → different memory usage.
-
-    **2. Delayed initialization causes memory bias**
-
-    Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others.  
-    Early workers already use up GPU memory → late workers still have empty memory → memory difference appears.
-
-    **3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing**
-
-    Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so:
-
-    - Non-rollout GPUs also join the communication.
-    - Later on, ``DeviceMesh`` init will fail due to "inconsistent memory".
-
-    **4. Different FSDP/TP loading behaviors also lead to mismatch**
-
-    If using:
-
-    ::
-
-    actor.fsdp_config.param_offload=True  
-    ref.fsdp_config.param_offload=True
-
-    Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.
 
 .. code-block:: bash
 
@@ -115,6 +71,51 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
         trainer.test_freq=10 \
         trainer.total_epochs=15 2>&1 | tee verl_demo.log
 
+Why export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples.
+
+2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP).
+
+3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks.
+
+Why might there be inconsistent GPU memory?
+"""""""""""""""""""""""""""""""""""""""""""
+
+**1. Ray Distributed Actor loads the model at different times**
+
+``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
+
+::
+
+self.rollout = SGLangRollout(...)
+
+Different workers initialize the model at different times → different memory usage.
+
+**2. Delayed initialization causes memory bias**
+
+Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others.  
+Early workers already use up GPU memory → late workers still have empty memory → memory difference appears.
+
+**3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing**
+
+Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so:
+
+- Non-rollout GPUs also join the communication.
+- Later on, ``DeviceMesh`` init will fail due to "inconsistent memory".
+
+**4. Different FSDP/TP loading behaviors also lead to mismatch**
+
+If using:
+
+::
+
+actor.fsdp_config.param_offload=True  
+ref.fsdp_config.param_offload=True
+
+Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.
+
 Using SGLang as the Inference Backend for PPO Training Across Multiple Machines
 ------------------------------------------------------------------------------
 SGLang also supports running verl's RAY-based cross-machine inference in IPv4 and IPv6 scenarios. In the script below, we use TP=16 for cross-machine inference. Suppose we have two interconnected machines: node0 with IP 10.94.16.4 and node1 with IP 10.94.16.5.

From a3cc342f57c586cfa29154c341c500cf3a74e84c Mon Sep 17 00:00:00 2001
From: mlmz <54172054+minleminzui@users.noreply.github.com>
Date: Thu, 17 Apr 2025 10:45:32 +0800
Subject: [PATCH 4/5] Update sglang_worker.rst

---
 docs/workers/sglang_worker.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst
index daa2cc6c230..15726b810af 100644
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
@@ -37,7 +37,7 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
 
 .. code-block:: bash
 
-    export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True  \
+    export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True
     PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
         data.train_files=$HOME/data/gsm8k/train.parquet \
         data.val_files=$HOME/data/gsm8k/test.parquet \
@@ -87,7 +87,7 @@ Why might there be inconsistent GPU memory?
 
 ``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
 
-::
+.. code-block:: python
 
 self.rollout = SGLangRollout(...)
 
@@ -109,7 +109,7 @@ Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` i
 
 If using:
 
-::
+.. code-block:: bash
 
 actor.fsdp_config.param_offload=True  
 ref.fsdp_config.param_offload=True

From fed7d7bfa76ea599dba084bf88630d63dcdd964b Mon Sep 17 00:00:00 2001
From: mlmz <54172054+minleminzui@users.noreply.github.com>
Date: Thu, 17 Apr 2025 10:46:58 +0800
Subject: [PATCH 5/5] Update sglang_worker.rst

---
 docs/workers/sglang_worker.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst
index 15726b810af..e42b2004358 100644
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
@@ -89,7 +89,7 @@ Why might there be inconsistent GPU memory?
 
 .. code-block:: python
 
-self.rollout = SGLangRollout(...)
+    self.rollout = SGLangRollout(...)
 
 Different workers initialize the model at different times → different memory usage.
 
@@ -111,8 +111,8 @@ If using:
 
 .. code-block:: bash
 
-actor.fsdp_config.param_offload=True  
-ref.fsdp_config.param_offload=True
+    actor.fsdp_config.param_offload=True  
+    ref.fsdp_config.param_offload=True
 
 Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.