[rollout] feat: global request-level load balancer single source routing by aoshen524 · Pull Request #5399 · verl-project/verl

aoshen524 · 2026-02-25T09:16:02Z

Summary

This PR is Task 1 (Phase 1) for routing roadmap (#5442):

Introduce global GlobalRequestLoadBalancer as runtime single source of truth for routing.
Keep request-level sticky (request_id -> server) + least-loaded routing.

Code Scope

verl/experimental/agent_loop/agent_loop.py
- global LB routing path (acquire/release)
verl/experimental/fully_async_policy/agent_loop/agent_loop.py
- keep required wiring only
tests/experimental/agent_loop/test_basic_agent_loop.py
- load balancer behavior tests aligned with current strategy

Experiment Configuration (Copied)

Model/Data:

Model: Qwen2.5-VL-32B
Data: osworld.json
Machine: 3 * H100 Nodes

Runtime config (from run):

MAX_ASSISTANT_TURNS=50
MAX_PROMPT_LENGTH=24488
RESPONSE_LENGTH=512
MAX_TOKENS=512
NNODES=3
N_GPU_PER_NODE=8
HARDWARE=3 x H100 nodes
SP_SIZE=1
MICRO_BATCH_SIZE_PER_GPU=1
PPO_MINI_BATCH_SIZE=6
TRAIN_BATCH_SIZE=24
ROLLOUT_N=16
STRATEGY=fsdp2
ENGINE=vllm
DUMP_VALIDATION_DATA=true
DUMP_ROLLOUT_DATA=false
DUMP_TASK_SCORES=false
DUMP_ROLLOUT_TIMING=true
FILTER_GROUPS=true
DETERMINISTIC=false
TRAJECTORY_SPLITTING=true
USE_FLEX_ATTENTION_HETERO=false
DUMMY_MODE=disabled
RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265
nsys_prof=false
ENABLE_PROMETHEUS_MONITORING=true
PROMETHEUS_PORT=9090
SCRIPT_DIR=/home/ubuntu/verl-grounding
PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml
PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b
EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012
HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld
TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json
TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json
PROJECT_NAME=verl_grounding
SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012
ALL_OFFLOAD=true
ACTOR_PARAM_OFFLOAD=true
ACTOR_GRAD_OFFLOAD=true
ACTOR_OPTIMIZER_OFFLOAD=true
REF_PARAM_OFFLOAD=true
PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b'

Load Distribution Comparison (Placeholder)

Before (baseline):

After (optimized):

Metrics (Before PR -> After PR)

Metric	Baseline	Optimized	Delta
`agent_loop/per_step/generate_sequences/mean`	`6.3596` s	`4.5312` s	`-1.8285` s (`-28.75%`)
`agent_loop/slowest/total_time`	`899.3761` s	`724.6898` s	`-174.6863` s (`-19.42%`)

gemini-code-assist

Code Review

The pull request successfully replaces the heapq-based local load balancing with a centralized GlobalRequestLoadBalancer Ray actor, which is a significant improvement for multi-node, multi-worker setups. The introduction of try/finally blocks for acquire_server and release_server ensures proper resource management and prevents counter leaks, enhancing the robustness of the system. However, there are a couple of areas that need attention to ensure correctness and maintainability.

gemini-code-assist · 2026-02-25T09:20:06Z

+            server_idx, server = self._choose_server_with_index(request_id)
+            self._inflight_requests[server_idx] += 1


In AsyncLLMServerManager, when _load_balancer is None (local load balancing mode), the _inflight_requests list is a shared mutable state within a single AsyncLLMServerManager instance. The AgentLoopWorker.generate_sequences method uses asyncio.gather to run multiple agent loops concurrently, and each loop will access the same self.server_manager instance. This creates a race condition when _acquire_server_with_index and _release_server modify self._inflight_requests concurrently. This can lead to incorrect in-flight request counts and thus suboptimal or incorrect load balancing. To fix this, access to self._inflight_requests in local mode must be protected by an asyncio.Lock.

Suggested change

server_idx, server = self._choose_server_with_index(request_id)

self._inflight_requests[server_idx] += 1

async with self._local_lock:

server_idx, server = self._choose_server_with_index(request_id)

self._inflight_requests[server_idx] += 1

gemini-code-assist · 2026-02-25T09:20:06Z

+            if self._inflight_requests[server_idx] > 0:
+                self._inflight_requests[server_idx] -= 1


Similar to the _acquire_server_with_index method, the _release_server method in local load balancing mode also modifies self._inflight_requests which is a shared mutable state. This operation needs to be protected by an asyncio.Lock to prevent race conditions and ensure the integrity of the in-flight request counts.

Suggested change

if self._inflight_requests[server_idx] > 0:

self._inflight_requests[server_idx] -= 1

async with self._local_lock:

if self._inflight_requests[server_idx] > 0:

self._inflight_requests[server_idx] -= 1

gemini-code-assist · 2026-02-25T09:20:06Z

+        self._load_balancer = load_balancer_handle
+
+        # In global load-balancer mode, server index must map to the same instance on every worker.
+        if self._load_balancer is None:
+            random.shuffle(self.server_handles)


To ensure that the _local_lock is always available when _load_balancer is None, it should be initialized unconditionally in the __init__ method. This prevents potential AttributeError if _local_lock is accessed before _load_balancer is determined to be None.

Suggested change

self._load_balancer = load_balancer_handle

# In global load-balancer mode, server index must map to the same instance on every worker.

if self._load_balancer is None:

random.shuffle(self.server_handles)

self.server_handles = list(server_handles)

self._load_balancer = load_balancer_handle

self._local_lock = asyncio.Lock() # Initialize lock unconditionally

# In global load-balancer mode, server index must map to the same instance on every worker.

if self._load_balancer is None:

random.shuffle(self.server_handles)

gemini-code-assist · 2026-02-25T09:20:06Z

+            if self._inflight_requests[candidate_idx] == min_inflight:
+                self._next_server_idx = (candidate_idx + 1) % num_servers
+                return candidate_idx
+        raise RuntimeError("Failed to select a server for routing.")


The RuntimeError at this line appears to be unreachable. If num_servers is positive (which is checked in __init__), the loop for offset in range(num_servers) will always find a server with the minimum in-flight requests. This makes the RuntimeError dead code. Consider replacing it with an assertion to indicate an impossible state, which is more appropriate for logic errors that should not occur.

Suggested change

raise RuntimeError("Failed to select a server for routing.")

assert False, "Logic error: should always find a server if num_servers > 0"

gemini-code-assist · 2026-02-25T09:20:06Z

+                self._next_server_idx = (candidate_idx + 1) % num_servers
+                return candidate_idx
+
+        raise RuntimeError("Failed to select a server for routing.")


Similar to the GlobalRequestLoadBalancer's _select_least_loaded_server_idx method, this RuntimeError is likely unreachable. The preceding check if not self._inflight_requests: handles the empty case. If _inflight_requests is not empty, the loop is guaranteed to find a server. Replacing this with an assertion would clarify that this is an unexpected, impossible state.

Suggested change

raise RuntimeError("Failed to select a server for routing.")

assert False, "Logic error: should always find a server if num_servers > 0"

wuxibin89 · 2026-03-02T08:05:14Z

+class GlobalRequestLoadBalancer:
+    """Global sticky-session + in-flight load balancer shared by all AgentLoopWorkers."""
+
+    def __init__(self, num_servers: int, max_cache_size: int = DEFAULT_ROUTING_CACHE_SIZE):


Better pass in server actor ids here, since we may support elastic scale up/down server in future.

should we do A or B? I think A is better here as IP is shown.

A:
self.server_addresses：
AgentLoopManager: ['10.1.201.2:35057', '10.1.201.2:45275', '10.1.201.2:38121', '10.1.201.2:46807']

B:
the ray attribute _actor_id.hex()：
['17e5a28bb23117714906f67323000000', '811580cb4c81c7fcb617c74623000000', 'a3ce9c328fb401163cdbb7d023000000', 'bad2866fe8059f1346b6442e23000000']

wuxibin89 · 2026-03-02T08:09:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a global GlobalRequestLoadBalancer to centralize routing logic, which is a significant architectural improvement over the previous per-worker approach. The implementation correctly handles sticky sessions and least-loaded routing with round-robin tie-breaking, and the new unit tests are comprehensive. I have one high-severity suggestion to improve the reliability of releasing server resources, which will prevent potential state corruption in the load balancer and enhance long-term system stability.

gemini-code-assist · 2026-03-02T08:13:04Z

+    async def _release_server(self, server_idx: Optional[int]) -> None:
+        if server_idx is None:
+            return
+        # Fire-and-forget: release is just a counter decrement, no need to await.
+        # Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
+        self._load_balancer.release_server.remote(server_idx=server_idx)


The current fire-and-forget approach for releasing a server can lead to state corruption in the load balancer if the release_server call is lost (e.g., due to the worker process crashing before the message is sent). This would cause the inflight request count to "leak", eventually making the load balancer believe servers are busier than they are, which could degrade or halt the system.

While the comment correctly identifies the risk of blocking in a finally clause, an unresponsive load balancer would likely cause subsequent acquire_server calls to block anyway. Prioritizing the consistency of the load balancer's state seems more crucial for long-term system stability.

I suggest awaiting the release_server call within a try...except block to handle potential Ray errors gracefully. This ensures the release is confirmed while preventing the worker from getting stuck indefinitely if the load balancer actor is down.

Suggested change

async def _release_server(self, server_idx: Optional[int]) -> None:

if server_idx is None:

return

# Fire-and-forget: release is just a counter decrement, no need to await.

# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.

self._load_balancer.release_server.remote(server_idx=server_idx)

async def _release_server(self, server_idx: Optional[int]) -> None:

if server_idx is None:

return

try:

await self._load_balancer.release_server.remote(server_idx=server_idx)

except ray.exceptions.RayError as e:

logger.warning(f"Failed to release server {server_idx} from load balancer: {e}")

wuxibin89 · 2026-03-05T07:08:25Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a global request-level load balancer, GlobalRequestLoadBalancer, to manage routing for AgentLoopWorkers. It refactors the load balancing logic from AsyncLLMServerManager into this new Ray actor, aiming to provide a single source of truth for routing and improve load distribution. A security audit confirmed that no high or critical security vulnerabilities were identified, and the implementation adheres to secure coding principles. The AI model's code review comments have been reviewed and are included as they provide valuable feedback on testing practices and potential logical issues.

gemini-code-assist · 2026-03-05T07:10:52Z

+        with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release"):
+            ray.get(lb.release_server.remote(server_idx=9))


The match argument in pytest.raises uses a very specific string. If the error message in GlobalRequestLoadBalancer.release_server changes slightly (e.g., adds punctuation or more context), this test will fail even if the underlying logic remains correct. It's generally safer to use a more general regex or check for the exception type and a key substring, or even better, the specific exception type and then assert on the message content separately if more precision is needed.

Suggested change

with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release"):

ray.get(lb.release_server.remote(server_idx=9))

with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release") as excinfo:

ray.get(lb.release_server.remote(server_idx=9))

assert "Invalid server_idx for release" in str(excinfo.value)

gemini-code-assist · 2026-03-05T07:10:52Z

+        with pytest.raises(ray.exceptions.RayTaskError, match="no inflight requests"):
+            ray.get(lb.release_server.remote(server_idx=1))


Similar to the previous comment, the match argument for pytest.raises is very specific. The error message in GlobalRequestLoadBalancer.release_server is Release called with no inflight requests on server {server_idx}. A minor change to this message could cause the test to break. Consider using a more robust matching strategy.

Suggested change

with pytest.raises(ray.exceptions.RayTaskError, match="no inflight requests"):

ray.get(lb.release_server.remote(server_idx=1))

with pytest.raises(ray.exceptions.RayTaskError, match="no inflight requests") as excinfo:

ray.get(lb.release_server.remote(server_idx=1))

assert "no inflight requests" in str(excinfo.value)

gemini-code-assist · 2026-03-05T07:10:52Z

+            if self._inflight_requests[candidate_idx] == min_inflight:
+                self._next_server_idx = (candidate_idx + 1) % num_servers
+                return candidate_idx
+        raise RuntimeError("Failed to select a server for routing.")


The RuntimeError("Failed to select a server for routing.") on line 78 seems potentially unreachable or indicates a fundamental flaw if it can be reached. Given that server_actor_ids is checked for emptiness in __init__, num_servers will always be greater than zero. The loop for offset in range(num_servers) guarantees that candidate_idx will iterate through all possible server indices. Since min_inflight is calculated from self._inflight_requests, at least one server will always have _inflight_requests[candidate_idx] == min_inflight. This suggests the RuntimeError might never be hit, or if it is, it points to an unexpected state that needs further investigation or a more specific error condition.

aoshen524 · 2026-03-05T07:17:39Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a global request load balancer to manage server routing, which is a great step towards a more robust and scalable architecture. The implementation of GlobalRequestLoadBalancer with sticky sessions and least-loaded routing with round-robin tie-breaking is solid, and the accompanying tests are comprehensive.

My main concern, which I've flagged as critical, is that the load_balancer_handle is treated as optional in several __init__ methods. However, the new implementation completely relies on this handle, and providing None will lead to runtime errors as there's no fallback to the old load balancing logic. I've suggested making this handle a required parameter to prevent such failures.

gemini-code-assist · 2026-03-05T07:22:05Z

+        load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
+    ):
        """Initialize the AsyncLLMServerManager.

        Args:
            config (DictConfig): whole config for main entrypoint.
            server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
-            max_cache_size (int, optional): max cache size for request_id to server mapping. Defaults to 10000.
+            load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.


The load_balancer_handle is marked as Optional, but if it is None, calls to _acquire_server_with_index and _release_server will fail with an AttributeError as self._load_balancer will be None. The old load balancing logic has been removed, so there is no fallback. The load_balancer_handle should be made non-optional.

Suggested change

load_balancer_handle: Optional[ray.actor.ActorHandle] = None,

):

"""Initialize the AsyncLLMServerManager.

Args:

config (DictConfig): whole config for main entrypoint.

server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.

max_cache_size (int, optional): max cache size for request_id to server mapping. Defaults to 10000.

load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.

load_balancer_handle: ray.actor.ActorHandle,

):

"""Initialize the AsyncLLMServerManager.

Args:

config (DictConfig): whole config for main entrypoint.

server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.

load_balancer_handle (ray.actor.ActorHandle): shared global load balancer actor.

gemini-code-assist · 2026-03-05T07:22:05Z

+        load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
    ):
+        """Initialize agent loop manager.
+        Args:
+            config (DictConfig): YAML config.
+            server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
+            reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.
+            load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.
+        """


The load_balancer_handle is marked as Optional, but it is passed to AsyncLLMServerManager which requires it. If load_balancer_handle is None, this will lead to a runtime AttributeError since the internal load balancing logic has been removed. The handle should be made non-optional.

Suggested change

load_balancer_handle: Optional[ray.actor.ActorHandle] = None,

):

"""Initialize agent loop manager.

Args:

config (DictConfig): YAML config.

server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.

reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.

load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.

"""

load_balancer_handle: ray.actor.ActorHandle,

):

"""Initialize agent loop manager.

Args:

config (DictConfig): YAML config.

server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.

reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.

load_balancer_handle (ray.actor.ActorHandle): shared global load balancer actor.

"""

gemini-code-assist · 2026-03-05T07:22:05Z

        config: DictConfig,
        server_handles: list[ray.actor.ActorHandle],
        reward_loop_worker_handles: list[ray.actor.ActorHandle] = None,
+        load_balancer_handle: Optional[ray.actor.ActorHandle] = None,


The load_balancer_handle is marked as Optional, but it is passed to FullyAsyncLLMServerManager and the parent AgentLoopWorker, both of which require it. If it is None, this will cause a runtime AttributeError. The handle should be made non-optional.

Suggested change

load_balancer_handle: Optional[ray.actor.ActorHandle] = None,

load_balancer_handle: ray.actor.ActorHandle,

aoshen524 · 2026-03-05T07:26:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a global request-level load balancer, GlobalRequestLoadBalancer, as a centralized Ray actor to manage routing for LLM servers. The implementation correctly handles sticky sessions for multi-turn conversations and falls back to a least-loaded strategy with round-robin tie-breaking for new requests. The new load balancer is integrated into AsyncLLMServerManager and wired through the AgentLoopManager and AgentLoopWorker components. Unit tests for the new load balancer are also included and cover various routing and session stickiness scenarios.

My main feedback concerns the implementation of _release_server in AsyncLLMServerManager. It is unnecessarily declared as an async method, which can be confusing. I've suggested changing it to a synchronous method for better code clarity, which also requires updating the call sites.

gemini-code-assist · 2026-03-05T07:29:00Z

+    async def _release_server(self, server_idx: Optional[int]) -> None:
+        if server_idx is None:
+            return
+        # Fire-and-forget: release is just a counter decrement, no need to await.
+        # Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
+        self._load_balancer.release_server.remote(server_idx=server_idx)


This method is declared as async but contains no await expressions. This is misleading because it suggests an asynchronous operation that can be awaited, but it completes synchronously. To improve clarity and align with its fire-and-forget nature, it should be a regular synchronous method. This is particularly important as it's used in a finally block.

Suggested change

async def _release_server(self, server_idx: Optional[int]) -> None:

if server_idx is None:

return

# Fire-and-forget: release is just a counter decrement, no need to await.

# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.

self._load_balancer.release_server.remote(server_idx=server_idx)

def _release_server(self, server_idx: Optional[int]) -> None:

if server_idx is None:

return

# Fire-and-forget: release is just a counter decrement, no need to await.

# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.

self._load_balancer.release_server.remote(server_idx=server_idx)

gemini-code-assist · 2026-03-05T07:29:00Z

+            )
+            return output
+        finally:
+            await self._release_server(server_idx)


Following the recommended change of _release_server to a synchronous method, the await keyword should be removed here to maintain consistency and correctness.

Suggested change

await self._release_server(server_idx)

self._release_server(server_idx)

aoshen524 · 2026-03-05T07:31:23Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a global request-level load balancer, which is a significant and well-executed architectural improvement. The new GlobalRequestLoadBalancer correctly implements sticky sessions and least-loaded routing with round-robin tie-breaking. The integration into AsyncLLMServerManager and the worker classes is clean, using a try...finally block to ensure resources are properly released, which is critical for robustness. The accompanying unit tests for the load balancer are comprehensive and cover routing logic, sticky sessions, and error conditions effectively. The changes are consistently applied across both the base and the fully asynchronous agent loops. I have not identified any critical or high-severity issues in this implementation.

wuxibin89 · 2026-03-05T10:19:11Z

+                self._next_server_idx = (candidate_idx + 1) % num_servers
+                return candidate_idx
+
+    def acquire_server(self, request_id: str) -> int:


Why not just return server_actor_id? It's tricky to ensure GlobalRequestLoadBalancer and AsyncLLMServerManager have consistent order.

wuxibin89 · 2026-03-05T10:31:14Z

+    def _select_least_loaded_server_idx(self) -> int:
+        min_inflight = min(self._inflight_requests)
+        num_servers = len(self._inflight_requests)
+        for offset in range(num_servers):


I'm a bit confuse here, why not just choose the server_actor_id with min_inflight?

…he affinity Replace per-worker heapq load balancing with a shared GlobalRequestLoadBalancer Ray actor that coordinates in-flight request counts across all workers. Key changes: - Add GlobalRequestLoadBalancer actor with acquire/release server protocol - Merge server_handles + server_addresses into servers: list[tuple[str, ActorHandle]] - Add GRPO prefix cache affinity via request_id-based sticky routing - Update AsyncLLMServerManager, AgentLoopWorker, FullyAsyncAgentLoopWorker signatures - Update test_special_server_adapter.py for new constructor API - Add load balancer unit tests in test_basic_agent_loop.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing (verl-project#5399) ## Summary This PR is **Task 1 (Phase 1)** for routing roadmap (verl-project#5442): - Introduce global `GlobalRequestLoadBalancer` as runtime single source of truth for routing. - Keep request-level sticky (`request_id -> server`) + least-loaded routing. ## Code Scope - `verl/experimental/agent_loop/agent_loop.py` - global LB routing path (`acquire/release`) - `verl/experimental/fully_async_policy/agent_loop/agent_loop.py` - keep required wiring only - `tests/experimental/agent_loop/test_basic_agent_loop.py` - load balancer behavior tests aligned with current strategy ## Experiment Configuration (Copied) Model/Data: - Model: **Qwen2.5-VL-32B** - Data: **osworld.json** - Machine: **3 * H100 Nodes** Runtime config (from run): ```bash MAX_ASSISTANT_TURNS=50 MAX_PROMPT_LENGTH=24488 RESPONSE_LENGTH=512 MAX_TOKENS=512 NNODES=3 N_GPU_PER_NODE=8 HARDWARE=3 x H100 nodes SP_SIZE=1 MICRO_BATCH_SIZE_PER_GPU=1 PPO_MINI_BATCH_SIZE=6 TRAIN_BATCH_SIZE=24 ROLLOUT_N=16 STRATEGY=fsdp2 ENGINE=vllm DUMP_VALIDATION_DATA=true DUMP_ROLLOUT_DATA=false DUMP_TASK_SCORES=false DUMP_ROLLOUT_TIMING=true FILTER_GROUPS=true DETERMINISTIC=false TRAJECTORY_SPLITTING=true USE_FLEX_ATTENTION_HETERO=false DUMMY_MODE=disabled RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265 nsys_prof=false ENABLE_PROMETHEUS_MONITORING=true PROMETHEUS_PORT=9090 SCRIPT_DIR=/home/ubuntu/verl-grounding PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012 HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json PROJECT_NAME=verl_grounding SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012 ALL_OFFLOAD=true ACTOR_PARAM_OFFLOAD=true ACTOR_GRAD_OFFLOAD=true ACTOR_OPTIMIZER_OFFLOAD=true REF_PARAM_OFFLOAD=true PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b' ``` ## Load Distribution Comparison (Placeholder) Before (baseline): <img width="1121" height="474" alt="148a35c49293e9a79fe25d2877d2670f" src="https://github.com/user-attachments/assets/771d4e09-77df-4a7e-b071-fe7fc570c2a4" /> After (optimized): <img width="1152" height="491" alt="d516d2756af159a4b2b86d9228e8ac38" src="https://github.com/user-attachments/assets/7b6a1f8d-18d9-494a-9243-42edea6386a4" /> ## Metrics (Before PR -> After PR) | Metric | Baseline | Optimized | Delta | |---|---:|---:|---:| | `agent_loop/per_step/generate_sequences/mean` | `6.3596` s | `4.5312` s | `-1.8285` s (`-28.75%`) | | `agent_loop/slowest/total_time` | `899.3761` s | `724.6898` s | `-174.6863` s (`-19.42%`) | Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Feb 25, 2026

View reviewed changes

aoshen524 force-pushed the feat/global-load-balancer branch from b3a9533 to 8794343 Compare February 25, 2026 09:36

aoshen524 changed the title ~~feat(agent_loop): replace heapq load balancing with global in-flight request balancer~~ feat(agent_loop): global in-flight load balancer with GRPO prefix cache affinity Feb 25, 2026

aoshen524 force-pushed the feat/global-load-balancer branch 3 times, most recently from 5f0a98c to 9bbcd96 Compare February 25, 2026 11:29

aoshen524 requested review from ArronHZG and wuxibin89 as code owners February 28, 2026 12:35

aoshen524 mentioned this pull request Feb 28, 2026

Roadmap: Global Load Balancer — routing strategies & elastic scaling #5442

Open

15 tasks

aoshen524 changed the title ~~feat(agent_loop): global in-flight load balancer with GRPO prefix cache affinity~~ feat(agent_loop): global request-level load balancer single source routing Feb 28, 2026

wuxibin89 reviewed Mar 2, 2026

View reviewed changes

wuxibin89 changed the title ~~feat(agent_loop): global request-level load balancer single source routing~~ [rollout] feat: global request-level load balancer single source routing Mar 2, 2026

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

wuxibin89 reviewed Mar 2, 2026

View reviewed changes

Comment thread verl/experimental/agent_loop/agent_loop.py

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

wuxibin89 reviewed Mar 5, 2026

View reviewed changes

wuxibin89 previously approved these changes Mar 6, 2026

View reviewed changes

aoshen524 dismissed wuxibin89’s stale review via 9f8de25 March 6, 2026 07:58

aoshen524 force-pushed the feat/global-load-balancer branch from 9f8de25 to 06b9c1b Compare March 6, 2026 13:10

wuxibin89 approved these changes Mar 9, 2026

View reviewed changes

wuxibin89 merged commit c37d4d5 into verl-project:main Mar 9, 2026
95 of 143 checks passed

This was referenced Apr 19, 2026

[rollout] feat: enhancing global load balancer (issue #5442) #6059

Closed

[rollout] feat: global group-level load balancer #6122

Open

		server_idx, server = self._choose_server_with_index(request_id)
		self._inflight_requests[server_idx] += 1

		if self._inflight_requests[server_idx] > 0:
		self._inflight_requests[server_idx] -= 1

	raise RuntimeError("Failed to select a server for routing.")
	assert False, "Logic error: should always find a server if num_servers > 0"

		with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release"):
		ray.get(lb.release_server.remote(server_idx=9))

	load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
	load_balancer_handle: ray.actor.ActorHandle,

	await self._release_server(server_idx)
	self._release_server(server_idx)

Conversation

aoshen524 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code Scope

Experiment Configuration (Copied)

Load Distribution Comparison (Placeholder)

Metrics (Before PR -> After PR)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen524 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen524 Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wuxibin89 commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen524 commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen524 commented Mar 5, 2026

Uh oh!

aoshen524 commented Feb 25, 2026 •

edited

Loading

aoshen524 Mar 2, 2026 •

edited

Loading