From ff4d29c4e184110efa76ca542a82efce0e9a6a81 Mon Sep 17 00:00:00 2001 From: minleminzui <2969413251@qq.com> Date: Wed, 16 Apr 2025 03:59:05 +0000 Subject: [PATCH 1/5] Doc: add a environment to fix that the memory capacity is unbalanced --- docs/workers/sglang_worker.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst index 13e0066503f..b649da4eec2 100644 --- a/docs/workers/sglang_worker.rst +++ b/docs/workers/sglang_worker.rst @@ -37,6 +37,7 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test. .. code-block:: bash + export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ From 85c670ee848f724cf78576579574c50dbf56d28c Mon Sep 17 00:00:00 2001 From: minleminzui <2969413251@qq.com> Date: Wed, 16 Apr 2025 04:07:48 +0000 Subject: [PATCH 2/5] more Co-authored-by: ocss884 --- docs/workers/sglang_worker.rst | 46 +++++++++++++++++++++++++++++++++- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst index b649da4eec2..4e7eb6ad1c5 100644 --- a/docs/workers/sglang_worker.rst +++ b/docs/workers/sglang_worker.rst @@ -34,10 +34,54 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test. python3 examples/data_preprocess/gsm8k.py 2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs: + SGLang + Verl + ^^^^^^^^^^^^^ + + 1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples. + + 2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP). + + 3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks. + + Why might there be inconsistent GPU memory? + """"""""""""""""""""""""""""""""""""""""""" + + **1. Ray Distributed Actor loads the model at different times** + + ``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times: + + :: + + self.rollout = SGLangRollout(...) + + Different workers initialize the model at different times → different memory usage. + + **2. Delayed initialization causes memory bias** + + Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others. + Early workers already use up GPU memory → late workers still have empty memory → memory difference appears. + + **3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing** + + Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so: + + - Non-rollout GPUs also join the communication. + - Later on, ``DeviceMesh`` init will fail due to "inconsistent memory". + + **4. Different FSDP/TP loading behaviors also lead to mismatch** + + If using: + + :: + + actor.fsdp_config.param_offload=True + ref.fsdp_config.param_offload=True + + Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout. .. code-block:: bash - export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK + export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True \ PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ From a768db4f282e2e90878ecd1066654ad321d5a66b Mon Sep 17 00:00:00 2001 From: minleminzui <2969413251@qq.com> Date: Wed, 16 Apr 2025 13:07:36 +0000 Subject: [PATCH 3/5] more --- docs/workers/sglang_worker.rst | 89 +++++++++++++++++----------------- 1 file changed, 45 insertions(+), 44 deletions(-) diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst index 4e7eb6ad1c5..daa2cc6c230 100644 --- a/docs/workers/sglang_worker.rst +++ b/docs/workers/sglang_worker.rst @@ -34,50 +34,6 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test. python3 examples/data_preprocess/gsm8k.py 2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs: - SGLang + Verl - ^^^^^^^^^^^^^ - - 1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples. - - 2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP). - - 3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks. - - Why might there be inconsistent GPU memory? - """"""""""""""""""""""""""""""""""""""""""" - - **1. Ray Distributed Actor loads the model at different times** - - ``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times: - - :: - - self.rollout = SGLangRollout(...) - - Different workers initialize the model at different times → different memory usage. - - **2. Delayed initialization causes memory bias** - - Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others. - Early workers already use up GPU memory → late workers still have empty memory → memory difference appears. - - **3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing** - - Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so: - - - Non-rollout GPUs also join the communication. - - Later on, ``DeviceMesh`` init will fail due to "inconsistent memory". - - **4. Different FSDP/TP loading behaviors also lead to mismatch** - - If using: - - :: - - actor.fsdp_config.param_offload=True - ref.fsdp_config.param_offload=True - - Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout. .. code-block:: bash @@ -115,6 +71,51 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test. trainer.test_freq=10 \ trainer.total_epochs=15 2>&1 | tee verl_demo.log +Why export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples. + +2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP). + +3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks. + +Why might there be inconsistent GPU memory? +""""""""""""""""""""""""""""""""""""""""""" + +**1. Ray Distributed Actor loads the model at different times** + +``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times: + +:: + +self.rollout = SGLangRollout(...) + +Different workers initialize the model at different times → different memory usage. + +**2. Delayed initialization causes memory bias** + +Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others. +Early workers already use up GPU memory → late workers still have empty memory → memory difference appears. + +**3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing** + +Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so: + +- Non-rollout GPUs also join the communication. +- Later on, ``DeviceMesh`` init will fail due to "inconsistent memory". + +**4. Different FSDP/TP loading behaviors also lead to mismatch** + +If using: + +:: + +actor.fsdp_config.param_offload=True +ref.fsdp_config.param_offload=True + +Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout. + Using SGLang as the Inference Backend for PPO Training Across Multiple Machines ------------------------------------------------------------------------------ SGLang also supports running verl's RAY-based cross-machine inference in IPv4 and IPv6 scenarios. In the script below, we use TP=16 for cross-machine inference. Suppose we have two interconnected machines: node0 with IP 10.94.16.4 and node1 with IP 10.94.16.5. From a3cc342f57c586cfa29154c341c500cf3a74e84c Mon Sep 17 00:00:00 2001 From: mlmz <54172054+minleminzui@users.noreply.github.com> Date: Thu, 17 Apr 2025 10:45:32 +0800 Subject: [PATCH 4/5] Update sglang_worker.rst --- docs/workers/sglang_worker.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst index daa2cc6c230..15726b810af 100644 --- a/docs/workers/sglang_worker.rst +++ b/docs/workers/sglang_worker.rst @@ -37,7 +37,7 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test. .. code-block:: bash - export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True \ + export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ @@ -87,7 +87,7 @@ Why might there be inconsistent GPU memory? ``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times: -:: +.. code-block:: python self.rollout = SGLangRollout(...) @@ -109,7 +109,7 @@ Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` i If using: -:: +.. code-block:: bash actor.fsdp_config.param_offload=True ref.fsdp_config.param_offload=True From fed7d7bfa76ea599dba084bf88630d63dcdd964b Mon Sep 17 00:00:00 2001 From: mlmz <54172054+minleminzui@users.noreply.github.com> Date: Thu, 17 Apr 2025 10:46:58 +0800 Subject: [PATCH 5/5] Update sglang_worker.rst --- docs/workers/sglang_worker.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/workers/sglang_worker.rst b/docs/workers/sglang_worker.rst index 15726b810af..e42b2004358 100644 --- a/docs/workers/sglang_worker.rst +++ b/docs/workers/sglang_worker.rst @@ -89,7 +89,7 @@ Why might there be inconsistent GPU memory? .. code-block:: python -self.rollout = SGLangRollout(...) + self.rollout = SGLangRollout(...) Different workers initialize the model at different times → different memory usage. @@ -111,8 +111,8 @@ If using: .. code-block:: bash -actor.fsdp_config.param_offload=True -ref.fsdp_config.param_offload=True + actor.fsdp_config.param_offload=True + ref.fsdp_config.param_offload=True Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.