Skip to content

[rollout] feat: global request-level load balancer single source routing#5399

Merged
wuxibin89 merged 1 commit intoverl-project:mainfrom
aoshen524:feat/global-load-balancer
Mar 9, 2026
Merged

[rollout] feat: global request-level load balancer single source routing#5399
wuxibin89 merged 1 commit intoverl-project:mainfrom
aoshen524:feat/global-load-balancer

Conversation

@aoshen524
Copy link
Copy Markdown
Contributor

@aoshen524 aoshen524 commented Feb 25, 2026

Summary

This PR is Task 1 (Phase 1) for routing roadmap (#5442):

  • Introduce global GlobalRequestLoadBalancer as runtime single source of truth for routing.
  • Keep request-level sticky (request_id -> server) + least-loaded routing.

Code Scope

  • verl/experimental/agent_loop/agent_loop.py
    • global LB routing path (acquire/release)
  • verl/experimental/fully_async_policy/agent_loop/agent_loop.py
    • keep required wiring only
  • tests/experimental/agent_loop/test_basic_agent_loop.py
    • load balancer behavior tests aligned with current strategy

Experiment Configuration (Copied)

Model/Data:

  • Model: Qwen2.5-VL-32B
  • Data: osworld.json
  • Machine: 3 * H100 Nodes

Runtime config (from run):

MAX_ASSISTANT_TURNS=50
MAX_PROMPT_LENGTH=24488
RESPONSE_LENGTH=512
MAX_TOKENS=512
NNODES=3
N_GPU_PER_NODE=8
HARDWARE=3 x H100 nodes
SP_SIZE=1
MICRO_BATCH_SIZE_PER_GPU=1
PPO_MINI_BATCH_SIZE=6
TRAIN_BATCH_SIZE=24
ROLLOUT_N=16
STRATEGY=fsdp2
ENGINE=vllm
DUMP_VALIDATION_DATA=true
DUMP_ROLLOUT_DATA=false
DUMP_TASK_SCORES=false
DUMP_ROLLOUT_TIMING=true
FILTER_GROUPS=true
DETERMINISTIC=false
TRAJECTORY_SPLITTING=true
USE_FLEX_ATTENTION_HETERO=false
DUMMY_MODE=disabled
RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265
nsys_prof=false
ENABLE_PROMETHEUS_MONITORING=true
PROMETHEUS_PORT=9090
SCRIPT_DIR=/home/ubuntu/verl-grounding
PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml
PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b
EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012
HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld
TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json
TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json
PROJECT_NAME=verl_grounding
SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012
ALL_OFFLOAD=true
ACTOR_PARAM_OFFLOAD=true
ACTOR_GRAD_OFFLOAD=true
ACTOR_OPTIMIZER_OFFLOAD=true
REF_PARAM_OFFLOAD=true
PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b'

Load Distribution Comparison (Placeholder)

Before (baseline):
148a35c49293e9a79fe25d2877d2670f

After (optimized):
d516d2756af159a4b2b86d9228e8ac38

Metrics (Before PR -> After PR)

Metric Baseline Optimized Delta
agent_loop/per_step/generate_sequences/mean 6.3596 s 4.5312 s -1.8285 s (-28.75%)
agent_loop/slowest/total_time 899.3761 s 724.6898 s -174.6863 s (-19.42%)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully replaces the heapq-based local load balancing with a centralized GlobalRequestLoadBalancer Ray actor, which is a significant improvement for multi-node, multi-worker setups. The introduction of try/finally blocks for acquire_server and release_server ensures proper resource management and prevents counter leaks, enhancing the robustness of the system. However, there are a couple of areas that need attention to ensure correctness and maintainability.

Comment on lines +159 to +160
server_idx, server = self._choose_server_with_index(request_id)
self._inflight_requests[server_idx] += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In AsyncLLMServerManager, when _load_balancer is None (local load balancing mode), the _inflight_requests list is a shared mutable state within a single AsyncLLMServerManager instance. The AgentLoopWorker.generate_sequences method uses asyncio.gather to run multiple agent loops concurrently, and each loop will access the same self.server_manager instance. This creates a race condition when _acquire_server_with_index and _release_server modify self._inflight_requests concurrently. This can lead to incorrect in-flight request counts and thus suboptimal or incorrect load balancing. To fix this, access to self._inflight_requests in local mode must be protected by an asyncio.Lock.

Suggested change
server_idx, server = self._choose_server_with_index(request_id)
self._inflight_requests[server_idx] += 1
async with self._local_lock:
server_idx, server = self._choose_server_with_index(request_id)
self._inflight_requests[server_idx] += 1

Comment on lines +174 to +175
if self._inflight_requests[server_idx] > 0:
self._inflight_requests[server_idx] -= 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the _acquire_server_with_index method, the _release_server method in local load balancing mode also modifies self._inflight_requests which is a shared mutable state. This operation needs to be protected by an asyncio.Lock to prevent race conditions and ensure the integrity of the in-flight request counts.

Suggested change
if self._inflight_requests[server_idx] > 0:
self._inflight_requests[server_idx] -= 1
async with self._local_lock:
if self._inflight_requests[server_idx] > 0:
self._inflight_requests[server_idx] -= 1

Comment on lines +117 to +121
self._load_balancer = load_balancer_handle

# In global load-balancer mode, server index must map to the same instance on every worker.
if self._load_balancer is None:
random.shuffle(self.server_handles)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To ensure that the _local_lock is always available when _load_balancer is None, it should be initialized unconditionally in the __init__ method. This prevents potential AttributeError if _local_lock is accessed before _load_balancer is determined to be None.

Suggested change
self._load_balancer = load_balancer_handle
# In global load-balancer mode, server index must map to the same instance on every worker.
if self._load_balancer is None:
random.shuffle(self.server_handles)
self.server_handles = list(server_handles)
self._load_balancer = load_balancer_handle
self._local_lock = asyncio.Lock() # Initialize lock unconditionally
# In global load-balancer mode, server index must map to the same instance on every worker.
if self._load_balancer is None:
random.shuffle(self.server_handles)

if self._inflight_requests[candidate_idx] == min_inflight:
self._next_server_idx = (candidate_idx + 1) % num_servers
return candidate_idx
raise RuntimeError("Failed to select a server for routing.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RuntimeError at this line appears to be unreachable. If num_servers is positive (which is checked in __init__), the loop for offset in range(num_servers) will always find a server with the minimum in-flight requests. This makes the RuntimeError dead code. Consider replacing it with an assertion to indicate an impossible state, which is more appropriate for logic errors that should not occur.

Suggested change
raise RuntimeError("Failed to select a server for routing.")
assert False, "Logic error: should always find a server if num_servers > 0"

self._next_server_idx = (candidate_idx + 1) % num_servers
return candidate_idx

raise RuntimeError("Failed to select a server for routing.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the GlobalRequestLoadBalancer's _select_least_loaded_server_idx method, this RuntimeError is likely unreachable. The preceding check if not self._inflight_requests: handles the empty case. If _inflight_requests is not empty, the loop is guaranteed to find a server. Replacing this with an assertion would clarify that this is an unexpected, impossible state.

Suggested change
raise RuntimeError("Failed to select a server for routing.")
assert False, "Logic error: should always find a server if num_servers > 0"

@aoshen524 aoshen524 force-pushed the feat/global-load-balancer branch from b3a9533 to 8794343 Compare February 25, 2026 09:36
@aoshen524 aoshen524 changed the title feat(agent_loop): replace heapq load balancing with global in-flight request balancer feat(agent_loop): global in-flight load balancer with GRPO prefix cache affinity Feb 25, 2026
@aoshen524 aoshen524 force-pushed the feat/global-load-balancer branch 3 times, most recently from 5f0a98c to 9bbcd96 Compare February 25, 2026 11:29
@aoshen524 aoshen524 changed the title feat(agent_loop): global in-flight load balancer with GRPO prefix cache affinity feat(agent_loop): global request-level load balancer single source routing Feb 28, 2026
class GlobalRequestLoadBalancer:
"""Global sticky-session + in-flight load balancer shared by all AgentLoopWorkers."""

def __init__(self, num_servers: int, max_cache_size: int = DEFAULT_ROUTING_CACHE_SIZE):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better pass in server actor ids here, since we may support elastic scale up/down server in future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Copy Markdown
Contributor Author

@aoshen524 aoshen524 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do A or B? I think A is better here as IP is shown.

A:
self.server_addresses:
AgentLoopManager: ['10.1.201.2:35057', '10.1.201.2:45275', '10.1.201.2:38121', '10.1.201.2:46807']

B:
the ray attribute _actor_id.hex():
['17e5a28bb23117714906f67323000000', '811580cb4c81c7fcb617c74623000000', 'a3ce9c328fb401163cdbb7d023000000', 'bad2866fe8059f1346b6442e23000000']

@wuxibin89
Copy link
Copy Markdown
Collaborator

/gemini review

@wuxibin89 wuxibin89 changed the title feat(agent_loop): global request-level load balancer single source routing [rollout] feat: global request-level load balancer single source routing Mar 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a global GlobalRequestLoadBalancer to centralize routing logic, which is a significant architectural improvement over the previous per-worker approach. The implementation correctly handles sticky sessions and least-loaded routing with round-robin tie-breaking, and the new unit tests are comprehensive. I have one high-severity suggestion to improve the reliability of releasing server resources, which will prevent potential state corruption in the load balancer and enhance long-term system stability.

Comment on lines +138 to +143
async def _release_server(self, server_idx: Optional[int]) -> None:
if server_idx is None:
return
# Fire-and-forget: release is just a counter decrement, no need to await.
# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
self._load_balancer.release_server.remote(server_idx=server_idx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current fire-and-forget approach for releasing a server can lead to state corruption in the load balancer if the release_server call is lost (e.g., due to the worker process crashing before the message is sent). This would cause the inflight request count to "leak", eventually making the load balancer believe servers are busier than they are, which could degrade or halt the system.

While the comment correctly identifies the risk of blocking in a finally clause, an unresponsive load balancer would likely cause subsequent acquire_server calls to block anyway. Prioritizing the consistency of the load balancer's state seems more crucial for long-term system stability.

I suggest awaiting the release_server call within a try...except block to handle potential Ray errors gracefully. This ensures the release is confirmed while preventing the worker from getting stuck indefinitely if the load balancer actor is down.

Suggested change
async def _release_server(self, server_idx: Optional[int]) -> None:
if server_idx is None:
return
# Fire-and-forget: release is just a counter decrement, no need to await.
# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
self._load_balancer.release_server.remote(server_idx=server_idx)
async def _release_server(self, server_idx: Optional[int]) -> None:
if server_idx is None:
return
try:
await self._load_balancer.release_server.remote(server_idx=server_idx)
except ray.exceptions.RayError as e:
logger.warning(f"Failed to release server {server_idx} from load balancer: {e}")

Comment thread verl/experimental/agent_loop/agent_loop.py
@wuxibin89
Copy link
Copy Markdown
Collaborator

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a global request-level load balancer, GlobalRequestLoadBalancer, to manage routing for AgentLoopWorkers. It refactors the load balancing logic from AsyncLLMServerManager into this new Ray actor, aiming to provide a single source of truth for routing and improve load distribution. A security audit confirmed that no high or critical security vulnerabilities were identified, and the implementation adheres to secure coding principles. The AI model's code review comments have been reviewed and are included as they provide valuable feedback on testing practices and potential logical issues.

Comment on lines +499 to +500
with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release"):
ray.get(lb.release_server.remote(server_idx=9))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The match argument in pytest.raises uses a very specific string. If the error message in GlobalRequestLoadBalancer.release_server changes slightly (e.g., adds punctuation or more context), this test will fail even if the underlying logic remains correct. It's generally safer to use a more general regex or check for the exception type and a key substring, or even better, the specific exception type and then assert on the message content separately if more precision is needed.

Suggested change
with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release"):
ray.get(lb.release_server.remote(server_idx=9))
with pytest.raises(ray.exceptions.RayTaskError, match="Invalid server_idx for release") as excinfo:
ray.get(lb.release_server.remote(server_idx=9))
assert "Invalid server_idx for release" in str(excinfo.value)

Comment on lines +504 to +505
with pytest.raises(ray.exceptions.RayTaskError, match="no inflight requests"):
ray.get(lb.release_server.remote(server_idx=1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, the match argument for pytest.raises is very specific. The error message in GlobalRequestLoadBalancer.release_server is Release called with no inflight requests on server {server_idx}. A minor change to this message could cause the test to break. Consider using a more robust matching strategy.

Suggested change
with pytest.raises(ray.exceptions.RayTaskError, match="no inflight requests"):
ray.get(lb.release_server.remote(server_idx=1))
with pytest.raises(ray.exceptions.RayTaskError, match="no inflight requests") as excinfo:
ray.get(lb.release_server.remote(server_idx=1))
assert "no inflight requests" in str(excinfo.value)

if self._inflight_requests[candidate_idx] == min_inflight:
self._next_server_idx = (candidate_idx + 1) % num_servers
return candidate_idx
raise RuntimeError("Failed to select a server for routing.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RuntimeError("Failed to select a server for routing.") on line 78 seems potentially unreachable or indicates a fundamental flaw if it can be reached. Given that server_actor_ids is checked for emptiness in __init__, num_servers will always be greater than zero. The loop for offset in range(num_servers) guarantees that candidate_idx will iterate through all possible server indices. Since min_inflight is calculated from self._inflight_requests, at least one server will always have _inflight_requests[candidate_idx] == min_inflight. This suggests the RuntimeError might never be hit, or if it is, it points to an unexpected state that needs further investigation or a more specific error condition.

@aoshen524
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a global request load balancer to manage server routing, which is a great step towards a more robust and scalable architecture. The implementation of GlobalRequestLoadBalancer with sticky sessions and least-loaded routing with round-robin tie-breaking is solid, and the accompanying tests are comprehensive.

My main concern, which I've flagged as critical, is that the load_balancer_handle is treated as optional in several __init__ methods. However, the new implementation completely relies on this handle, and providing None will lead to runtime errors as there's no fallback to the old load balancing logic. I've suggested making this handle a required parameter to prevent such failures.

Comment on lines +121 to +128
load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
):
"""Initialize the AsyncLLMServerManager.

Args:
config (DictConfig): whole config for main entrypoint.
server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
max_cache_size (int, optional): max cache size for request_id to server mapping. Defaults to 10000.
load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The load_balancer_handle is marked as Optional, but if it is None, calls to _acquire_server_with_index and _release_server will fail with an AttributeError as self._load_balancer will be None. The old load balancing logic has been removed, so there is no fallback. The load_balancer_handle should be made non-optional.

Suggested change
load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
):
"""Initialize the AsyncLLMServerManager.
Args:
config (DictConfig): whole config for main entrypoint.
server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
max_cache_size (int, optional): max cache size for request_id to server mapping. Defaults to 10000.
load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.
load_balancer_handle: ray.actor.ActorHandle,
):
"""Initialize the AsyncLLMServerManager.
Args:
config (DictConfig): whole config for main entrypoint.
server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
load_balancer_handle (ray.actor.ActorHandle): shared global load balancer actor.

Comment on lines +416 to +424
load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
):
"""Initialize agent loop manager.
Args:
config (DictConfig): YAML config.
server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.
load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The load_balancer_handle is marked as Optional, but it is passed to AsyncLLMServerManager which requires it. If load_balancer_handle is None, this will lead to a runtime AttributeError since the internal load balancing logic has been removed. The handle should be made non-optional.

Suggested change
load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
):
"""Initialize agent loop manager.
Args:
config (DictConfig): YAML config.
server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.
load_balancer_handle (Optional[ray.actor.ActorHandle]): shared global load balancer actor.
"""
load_balancer_handle: ray.actor.ActorHandle,
):
"""Initialize agent loop manager.
Args:
config (DictConfig): YAML config.
server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.
load_balancer_handle (ray.actor.ActorHandle): shared global load balancer actor.
"""

config: DictConfig,
server_handles: list[ray.actor.ActorHandle],
reward_loop_worker_handles: list[ray.actor.ActorHandle] = None,
load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The load_balancer_handle is marked as Optional, but it is passed to FullyAsyncLLMServerManager and the parent AgentLoopWorker, both of which require it. If it is None, this will cause a runtime AttributeError. The handle should be made non-optional.

Suggested change
load_balancer_handle: Optional[ray.actor.ActorHandle] = None,
load_balancer_handle: ray.actor.ActorHandle,

@aoshen524
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a global request-level load balancer, GlobalRequestLoadBalancer, as a centralized Ray actor to manage routing for LLM servers. The implementation correctly handles sticky sessions for multi-turn conversations and falls back to a least-loaded strategy with round-robin tie-breaking for new requests. The new load balancer is integrated into AsyncLLMServerManager and wired through the AgentLoopManager and AgentLoopWorker components. Unit tests for the new load balancer are also included and cover various routing and session stickiness scenarios.

My main feedback concerns the implementation of _release_server in AsyncLLMServerManager. It is unnecessarily declared as an async method, which can be confusing. I've suggested changing it to a synchronous method for better code clarity, which also requires updating the call sites.

Comment on lines +140 to +145
async def _release_server(self, server_idx: Optional[int]) -> None:
if server_idx is None:
return
# Fire-and-forget: release is just a counter decrement, no need to await.
# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
self._load_balancer.release_server.remote(server_idx=server_idx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This method is declared as async but contains no await expressions. This is misleading because it suggests an asynchronous operation that can be awaited, but it completes synchronously. To improve clarity and align with its fire-and-forget nature, it should be a regular synchronous method. This is particularly important as it's used in a finally block.

Suggested change
async def _release_server(self, server_idx: Optional[int]) -> None:
if server_idx is None:
return
# Fire-and-forget: release is just a counter decrement, no need to await.
# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
self._load_balancer.release_server.remote(server_idx=server_idx)
def _release_server(self, server_idx: Optional[int]) -> None:
if server_idx is None:
return
# Fire-and-forget: release is just a counter decrement, no need to await.
# Awaiting here risks blocking the finally clause if the LB actor is unresponsive.
self._load_balancer.release_server.remote(server_idx=server_idx)

)
return output
finally:
await self._release_server(server_idx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Following the recommended change of _release_server to a synchronous method, the await keyword should be removed here to maintain consistency and correctness.

Suggested change
await self._release_server(server_idx)
self._release_server(server_idx)

@aoshen524
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a global request-level load balancer, which is a significant and well-executed architectural improvement. The new GlobalRequestLoadBalancer correctly implements sticky sessions and least-loaded routing with round-robin tie-breaking. The integration into AsyncLLMServerManager and the worker classes is clean, using a try...finally block to ensure resources are properly released, which is critical for robustness. The accompanying unit tests for the load balancer are comprehensive and cover routing logic, sticky sessions, and error conditions effectively. The changes are consistently applied across both the base and the fully asynchronous agent loops. I have not identified any critical or high-severity issues in this implementation.

self._next_server_idx = (candidate_idx + 1) % num_servers
return candidate_idx

def acquire_server(self, request_id: str) -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return server_actor_id? It's tricky to ensure GlobalRequestLoadBalancer and AsyncLLMServerManager have consistent order.

def _select_least_loaded_server_idx(self) -> int:
min_inflight = min(self._inflight_requests)
num_servers = len(self._inflight_requests)
for offset in range(num_servers):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confuse here, why not just choose the server_actor_id with min_inflight?

wuxibin89
wuxibin89 previously approved these changes Mar 6, 2026
…he affinity

Replace per-worker heapq load balancing with a shared GlobalRequestLoadBalancer
Ray actor that coordinates in-flight request counts across all workers.

Key changes:
- Add GlobalRequestLoadBalancer actor with acquire/release server protocol
- Merge server_handles + server_addresses into servers: list[tuple[str, ActorHandle]]
- Add GRPO prefix cache affinity via request_id-based sticky routing
- Update AsyncLLMServerManager, AgentLoopWorker, FullyAsyncAgentLoopWorker signatures
- Update test_special_server_adapter.py for new constructor API
- Add load balancer unit tests in test_basic_agent_loop.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aoshen524 aoshen524 force-pushed the feat/global-load-balancer branch from 9f8de25 to 06b9c1b Compare March 6, 2026 13:10
@wuxibin89 wuxibin89 merged commit c37d4d5 into verl-project:main Mar 9, 2026
95 of 143 checks passed
guillemgt pushed a commit to guillemgt/verl that referenced this pull request Mar 9, 2026
…ing (verl-project#5399)

## Summary

This PR is **Task 1 (Phase 1)** for routing roadmap (verl-project#5442):

- Introduce global `GlobalRequestLoadBalancer` as runtime single source
of truth for routing.
- Keep request-level sticky (`request_id -> server`) + least-loaded
routing.

## Code Scope

- `verl/experimental/agent_loop/agent_loop.py`
  - global LB routing path (`acquire/release`)
- `verl/experimental/fully_async_policy/agent_loop/agent_loop.py`
  - keep required wiring only
- `tests/experimental/agent_loop/test_basic_agent_loop.py`
  - load balancer behavior tests aligned with current strategy

## Experiment Configuration (Copied)

Model/Data:
- Model: **Qwen2.5-VL-32B**
- Data: **osworld.json**
- Machine: **3 * H100 Nodes**

Runtime config (from run):
```bash
MAX_ASSISTANT_TURNS=50
MAX_PROMPT_LENGTH=24488
RESPONSE_LENGTH=512
MAX_TOKENS=512
NNODES=3
N_GPU_PER_NODE=8
HARDWARE=3 x H100 nodes
SP_SIZE=1
MICRO_BATCH_SIZE_PER_GPU=1
PPO_MINI_BATCH_SIZE=6
TRAIN_BATCH_SIZE=24
ROLLOUT_N=16
STRATEGY=fsdp2
ENGINE=vllm
DUMP_VALIDATION_DATA=true
DUMP_ROLLOUT_DATA=false
DUMP_TASK_SCORES=false
DUMP_ROLLOUT_TIMING=true
FILTER_GROUPS=true
DETERMINISTIC=false
TRAJECTORY_SPLITTING=true
USE_FLEX_ATTENTION_HETERO=false
DUMMY_MODE=disabled
RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265
nsys_prof=false
ENABLE_PROMETHEUS_MONITORING=true
PROMETHEUS_PORT=9090
SCRIPT_DIR=/home/ubuntu/verl-grounding
PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml
PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b
EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012
HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld
TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json
TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json
PROJECT_NAME=verl_grounding
SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012
ALL_OFFLOAD=true
ACTOR_PARAM_OFFLOAD=true
ACTOR_GRAD_OFFLOAD=true
ACTOR_OPTIMIZER_OFFLOAD=true
REF_PARAM_OFFLOAD=true
PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b'
```


## Load Distribution Comparison (Placeholder)

Before (baseline):
<img width="1121" height="474" alt="148a35c49293e9a79fe25d2877d2670f"
src="https://github.com/user-attachments/assets/771d4e09-77df-4a7e-b071-fe7fc570c2a4"
/>


After (optimized):
<img width="1152" height="491" alt="d516d2756af159a4b2b86d9228e8ac38"
src="https://github.com/user-attachments/assets/7b6a1f8d-18d9-494a-9243-42edea6386a4"
/>

## Metrics (Before PR -> After PR)

| Metric | Baseline | Optimized | Delta |
|---|---:|---:|---:|
| `agent_loop/per_step/generate_sequences/mean` | `6.3596` s | `4.5312`
s | `-1.8285` s (`-28.75%`) |
| `agent_loop/slowest/total_time` | `899.3761` s | `724.6898` s |
`-174.6863` s (`-19.42%`) |

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
DearFishi pushed a commit to KunlunxinAD/verl that referenced this pull request Mar 20, 2026
…ing (verl-project#5399)

## Summary

This PR is **Task 1 (Phase 1)** for routing roadmap (verl-project#5442):

- Introduce global `GlobalRequestLoadBalancer` as runtime single source
of truth for routing.
- Keep request-level sticky (`request_id -> server`) + least-loaded
routing.

## Code Scope

- `verl/experimental/agent_loop/agent_loop.py`
  - global LB routing path (`acquire/release`)
- `verl/experimental/fully_async_policy/agent_loop/agent_loop.py`
  - keep required wiring only
- `tests/experimental/agent_loop/test_basic_agent_loop.py`
  - load balancer behavior tests aligned with current strategy

## Experiment Configuration (Copied)

Model/Data:
- Model: **Qwen2.5-VL-32B**
- Data: **osworld.json**
- Machine: **3 * H100 Nodes**

Runtime config (from run):
```bash
MAX_ASSISTANT_TURNS=50
MAX_PROMPT_LENGTH=24488
RESPONSE_LENGTH=512
MAX_TOKENS=512
NNODES=3
N_GPU_PER_NODE=8
HARDWARE=3 x H100 nodes
SP_SIZE=1
MICRO_BATCH_SIZE_PER_GPU=1
PPO_MINI_BATCH_SIZE=6
TRAIN_BATCH_SIZE=24
ROLLOUT_N=16
STRATEGY=fsdp2
ENGINE=vllm
DUMP_VALIDATION_DATA=true
DUMP_ROLLOUT_DATA=false
DUMP_TASK_SCORES=false
DUMP_ROLLOUT_TIMING=true
FILTER_GROUPS=true
DETERMINISTIC=false
TRAJECTORY_SPLITTING=true
USE_FLEX_ATTENTION_HETERO=false
DUMMY_MODE=disabled
RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265
nsys_prof=false
ENABLE_PROMETHEUS_MONITORING=true
PROMETHEUS_PORT=9090
SCRIPT_DIR=/home/ubuntu/verl-grounding
PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml
PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b
EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012
HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld
TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json
TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json
PROJECT_NAME=verl_grounding
SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012
ALL_OFFLOAD=true
ACTOR_PARAM_OFFLOAD=true
ACTOR_GRAD_OFFLOAD=true
ACTOR_OPTIMIZER_OFFLOAD=true
REF_PARAM_OFFLOAD=true
PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b'
```


## Load Distribution Comparison (Placeholder)

Before (baseline):
<img width="1121" height="474" alt="148a35c49293e9a79fe25d2877d2670f"
src="https://github.com/user-attachments/assets/771d4e09-77df-4a7e-b071-fe7fc570c2a4"
/>


After (optimized):
<img width="1152" height="491" alt="d516d2756af159a4b2b86d9228e8ac38"
src="https://github.com/user-attachments/assets/7b6a1f8d-18d9-494a-9243-42edea6386a4"
/>

## Metrics (Before PR -> After PR)

| Metric | Baseline | Optimized | Delta |
|---|---:|---:|---:|
| `agent_loop/per_step/generate_sequences/mean` | `6.3596` s | `4.5312`
s | `-1.8285` s (`-28.75%`) |
| `agent_loop/slowest/total_time` | `899.3761` s | `724.6898` s |
`-174.6863` s (`-19.42%`) |

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
sijyang pushed a commit to sijyang/verl that referenced this pull request Apr 1, 2026
…ing (verl-project#5399)

## Summary

This PR is **Task 1 (Phase 1)** for routing roadmap (verl-project#5442):

- Introduce global `GlobalRequestLoadBalancer` as runtime single source
of truth for routing.
- Keep request-level sticky (`request_id -> server`) + least-loaded
routing.

## Code Scope

- `verl/experimental/agent_loop/agent_loop.py`
  - global LB routing path (`acquire/release`)
- `verl/experimental/fully_async_policy/agent_loop/agent_loop.py`
  - keep required wiring only
- `tests/experimental/agent_loop/test_basic_agent_loop.py`
  - load balancer behavior tests aligned with current strategy

## Experiment Configuration (Copied)

Model/Data:
- Model: **Qwen2.5-VL-32B**
- Data: **osworld.json**
- Machine: **3 * H100 Nodes**

Runtime config (from run):
```bash
MAX_ASSISTANT_TURNS=50
MAX_PROMPT_LENGTH=24488
RESPONSE_LENGTH=512
MAX_TOKENS=512
NNODES=3
N_GPU_PER_NODE=8
HARDWARE=3 x H100 nodes
SP_SIZE=1
MICRO_BATCH_SIZE_PER_GPU=1
PPO_MINI_BATCH_SIZE=6
TRAIN_BATCH_SIZE=24
ROLLOUT_N=16
STRATEGY=fsdp2
ENGINE=vllm
DUMP_VALIDATION_DATA=true
DUMP_ROLLOUT_DATA=false
DUMP_TASK_SCORES=false
DUMP_ROLLOUT_TIMING=true
FILTER_GROUPS=true
DETERMINISTIC=false
TRAJECTORY_SPLITTING=true
USE_FLEX_ATTENTION_HETERO=false
DUMMY_MODE=disabled
RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265
nsys_prof=false
ENABLE_PROMETHEUS_MONITORING=true
PROMETHEUS_PORT=9090
SCRIPT_DIR=/home/ubuntu/verl-grounding
PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml
PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b
EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012
HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld
TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json
TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json
PROJECT_NAME=verl_grounding
SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012
ALL_OFFLOAD=true
ACTOR_PARAM_OFFLOAD=true
ACTOR_GRAD_OFFLOAD=true
ACTOR_OPTIMIZER_OFFLOAD=true
REF_PARAM_OFFLOAD=true
PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b'
```


## Load Distribution Comparison (Placeholder)

Before (baseline):
<img width="1121" height="474" alt="148a35c49293e9a79fe25d2877d2670f"
src="https://github.com/user-attachments/assets/771d4e09-77df-4a7e-b071-fe7fc570c2a4"
/>


After (optimized):
<img width="1152" height="491" alt="d516d2756af159a4b2b86d9228e8ac38"
src="https://github.com/user-attachments/assets/7b6a1f8d-18d9-494a-9243-42edea6386a4"
/>

## Metrics (Before PR -> After PR)

| Metric | Baseline | Optimized | Delta |
|---|---:|---:|---:|
| `agent_loop/per_step/generate_sequences/mean` | `6.3596` s | `4.5312`
s | `-1.8285` s (`-28.75%`) |
| `agent_loop/slowest/total_time` | `899.3761` s | `724.6898` s |
`-174.6863` s (`-19.42%`) |

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
DaizeDong pushed a commit to DaizeDong/verl that referenced this pull request Apr 19, 2026
…ing (verl-project#5399)

## Summary

This PR is **Task 1 (Phase 1)** for routing roadmap (verl-project#5442):

- Introduce global `GlobalRequestLoadBalancer` as runtime single source
of truth for routing.
- Keep request-level sticky (`request_id -> server`) + least-loaded
routing.

## Code Scope

- `verl/experimental/agent_loop/agent_loop.py`
  - global LB routing path (`acquire/release`)
- `verl/experimental/fully_async_policy/agent_loop/agent_loop.py`
  - keep required wiring only
- `tests/experimental/agent_loop/test_basic_agent_loop.py`
  - load balancer behavior tests aligned with current strategy

## Experiment Configuration (Copied)

Model/Data:
- Model: **Qwen2.5-VL-32B**
- Data: **osworld.json**
- Machine: **3 * H100 Nodes**

Runtime config (from run):
```bash
MAX_ASSISTANT_TURNS=50
MAX_PROMPT_LENGTH=24488
RESPONSE_LENGTH=512
MAX_TOKENS=512
NNODES=3
N_GPU_PER_NODE=8
HARDWARE=3 x H100 nodes
SP_SIZE=1
MICRO_BATCH_SIZE_PER_GPU=1
PPO_MINI_BATCH_SIZE=6
TRAIN_BATCH_SIZE=24
ROLLOUT_N=16
STRATEGY=fsdp2
ENGINE=vllm
DUMP_VALIDATION_DATA=true
DUMP_ROLLOUT_DATA=false
DUMP_TASK_SCORES=false
DUMP_ROLLOUT_TIMING=true
FILTER_GROUPS=true
DETERMINISTIC=false
TRAJECTORY_SPLITTING=true
USE_FLEX_ATTENTION_HETERO=false
DUMMY_MODE=disabled
RAY_DASHBOARD_ADDRESS=http://127.0.0.1:8265
nsys_prof=false
ENABLE_PROMETHEUS_MONITORING=true
PROMETHEUS_PORT=9090
SCRIPT_DIR=/home/ubuntu/verl-grounding
PROMETHEUS_CONFIG_FILE=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml
PROMETHEUS_SERVED_MODEL_NAME=qwen2_5_vl_32b
EXPERIMENT_NAME=gui-agent-fsdp2-gb24-3nodes-20260223_145012
HF_MODEL_PATH=/mnt/data2/models_and_datasets/sft-32b-bigs1-1027-s2-0103-osworld
TRAIN_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_train.json
TEST_DATA_PATH=data/nongoogledrive_02_04_no_infeasible_and_chart_editing_test.json
PROJECT_NAME=verl_grounding
SAVE_PATH=/mnt/data2/saves/verl_grounding/gui-agent-fsdp2-gb24-3nodes-20260223_145012
ALL_OFFLOAD=true
ACTOR_PARAM_OFFLOAD=true
ACTOR_GRAD_OFFLOAD=true
ACTOR_OPTIMIZER_OFFLOAD=true
REF_PARAM_OFFLOAD=true
PROMETHEUS_PARAMS='actor_rollout_ref.rollout.disable_log_stats=False actor_rollout_ref.rollout.prometheus.enable=True actor_rollout_ref.rollout.prometheus.port=9090 actor_rollout_ref.rollout.prometheus.file=/home/ubuntu/verl-grounding/gui_scripts/vllm_monitoring/prometheus.yaml actor_rollout_ref.rollout.prometheus.served_model_name=qwen2_5_vl_32b'
```


## Load Distribution Comparison (Placeholder)

Before (baseline):
<img width="1121" height="474" alt="148a35c49293e9a79fe25d2877d2670f"
src="https://github.com/user-attachments/assets/771d4e09-77df-4a7e-b071-fe7fc570c2a4"
/>


After (optimized):
<img width="1152" height="491" alt="d516d2756af159a4b2b86d9228e8ac38"
src="https://github.com/user-attachments/assets/7b6a1f8d-18d9-494a-9243-42edea6386a4"
/>

## Metrics (Before PR -> After PR)

| Metric | Baseline | Optimized | Delta |
|---|---:|---:|---:|
| `agent_loop/per_step/generate_sequences/mean` | `6.3596` s | `4.5312`
s | `-1.8285` s (`-28.75%`) |
| `agent_loop/slowest/total_time` | `899.3761` s | `724.6898` s |
`-174.6863` s (`-19.42%`) |

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants