[Serve] Optimize replica routing request data structures by abrarsheikh · Pull Request #60139 · ray-project/ray

abrarsheikh · 2026-01-14T16:36:05Z

O(1) Pending Request Lookups
- Added dict indices (_pending_requests_by_id and _pending_requests_by_model_id) for fast lookups
- Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model
Cached Replica List
- Added _replicas_list cache to avoid O(n) dict-to-list conversion on every routing iteration
- List updated only when replicas change via update_replicas() or on_replica_actor_died()
Lazy Cleanup Strategy
- Done futures are lazily cleaned from _pending_requests_by_model_id during lookups using O(1) popleft()
- Avoids expensive O(n) removal from deques
Optimized Retry Insertion
- Extracted sorted insertion logic into _insert_pending_request_sorted() helper
- O(1) fast path for common case (recent retries append to end)
Simplified pow_2_router
- Removed redundant dict creation per routing call
- Direct lookup via self._replicas[chosen_id] instead of building temporary map

random.sample → Direct Selection
Lazy Hash Caching (common.py)
Metrics Throttling (request_router.py, constants.py)

flamegraph of the router after all the optimization

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request significantly optimizes the replica routing mechanism in Ray Serve by refactoring data structures and lookup logic. The changes introduce dictionary-based indices (_pending_requests_by_id, _pending_requests_by_model_id) for O(1) lookups of pending requests, replacing previous O(N) iterations over deques. Lazy cleanup of completed futures is implemented to prevent memory leaks, and a cached list of replicas (_replicas_list) is maintained to avoid redundant list conversions. These improvements enhance the efficiency of request matching, fulfillment, and replica selection, leading to better performance, especially in high-throughput or multiplexed model scenarios. The code is well-commented, explaining the rationale behind the optimizations.

Signed-off-by: abrar <abrar@anyscale.com>

python/ray/serve/_private/request_router/request_router.py

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale

great improvements, nice work!
left some comments, else LGTM

python/ray/serve/_private/request_router/pow_2_router.py

python/ray/serve/_private/request_router/request_router.py

Signed-off-by: abrar <abrar@anyscale.com>

python/ray/serve/_private/request_router/request_router.py

Signed-off-by: abrar <abrar@anyscale.com>

python/ray/serve/_private/request_router/request_router.py

Signed-off-by: abrar <abrar@anyscale.com>

…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in #59233 after #60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>

…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

abrarsheikh added 2 commits January 14, 2026 14:38

[Serve] optimize request router

a5f96e7

Signed-off-by: abrar <abrar@anyscale.com>

cache replica list

7031e86

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

abrarsheikh added the go add ONLY when ready to merge, run all tests label Jan 14, 2026

bug

1a9c9a3

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh marked this pull request as ready for review January 14, 2026 20:04

abrarsheikh requested a review from a team as a code owner January 14, 2026 20:04

abrarsheikh requested a review from akyang-anyscale January 14, 2026 20:05

throttle metrics for queue len

84ac4ae

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from harshit-anyscale January 14, 2026 22:33

randomize 2 replicas

e9969f6

Signed-off-by: abrar <abrar@anyscale.com>

cursor bot reviewed Jan 15, 2026

View reviewed changes

python/ray/serve/_private/request_router/request_router.py Show resolved Hide resolved

pop queue len

a140850

Signed-off-by: abrar <abrar@anyscale.com>

ray-gardener bot added the serve Ray Serve Related Issue label Jan 15, 2026

abrarsheikh added 2 commits January 15, 2026 02:36

fix test

0f89ca5

Signed-off-by: abrar <abrar@anyscale.com>

fix test

476c2b0

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale reviewed Jan 15, 2026

View reviewed changes

use randbits

f0456ea

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale approved these changes Jan 16, 2026

View reviewed changes

python/ray/serve/_private/request_router/request_router.py Outdated Show resolved Hide resolved

simplify code

c952ea7

Signed-off-by: abrar <abrar@anyscale.com>

cursor bot reviewed Jan 16, 2026

View reviewed changes

python/ray/serve/_private/request_router/request_router.py Show resolved Hide resolved

dedupe

7d9b35e

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale approved these changes Jan 16, 2026

View reviewed changes

abrarsheikh merged commit 00c877d into master Jan 16, 2026
6 checks passed

abrarsheikh deleted the opt-routing branch January 16, 2026 18:10

abrarsheikh mentioned this pull request Jan 20, 2026

[Serve] send requests to replica immediately when replicas are full and max_queued = -1 #60306

Closed

eicherseiji mentioned this pull request Jan 20, 2026

[Serve] Fix flaky test_router_queue_len_metric #60333

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Optimize replica routing request data structures#60139

[Serve] Optimize replica routing request data structures#60139
abrarsheikh merged 11 commits intomasterfrom
opt-routing

abrarsheikh commented Jan 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

harshit-anyscale left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abrarsheikh commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

harshit-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abrarsheikh commented Jan 14, 2026 •

edited

Loading