Skip to content

[Serve] Optimize replica routing request data structures#60139

Merged
abrarsheikh merged 11 commits intomasterfrom
opt-routing
Jan 16, 2026
Merged

[Serve] Optimize replica routing request data structures#60139
abrarsheikh merged 11 commits intomasterfrom
opt-routing

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Jan 14, 2026

  1. O(1) Pending Request Lookups

    • Added dict indices (_pending_requests_by_id and _pending_requests_by_model_id) for fast lookups
    • Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model
  2. Cached Replica List

    • Added _replicas_list cache to avoid O(n) dict-to-list conversion on every routing iteration
    • List updated only when replicas change via update_replicas() or on_replica_actor_died()
  3. Lazy Cleanup Strategy

    • Done futures are lazily cleaned from _pending_requests_by_model_id during lookups using O(1) popleft()
    • Avoids expensive O(n) removal from deques
  4. Optimized Retry Insertion

    • Extracted sorted insertion logic into _insert_pending_request_sorted() helper
    • O(1) fast path for common case (recent retries append to end)
  5. Simplified pow_2_router

    • Removed redundant dict creation per routing call
    • Direct lookup via self._replicas[chosen_id] instead of building temporary map
image
  1. random.sample → Direct Selection
  2. Lazy Hash Caching (common.py)
  3. Metrics Throttling (request_router.py, constants.py)
image

flamegraph of the router after all the optimization
image

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly optimizes the replica routing mechanism in Ray Serve by refactoring data structures and lookup logic. The changes introduce dictionary-based indices (_pending_requests_by_id, _pending_requests_by_model_id) for O(1) lookups of pending requests, replacing previous O(N) iterations over deques. Lazy cleanup of completed futures is implemented to prevent memory leaks, and a cached list of replicas (_replicas_list) is maintained to avoid redundant list conversions. These improvements enhance the efficiency of request matching, fulfillment, and replica selection, leading to better performance, especially in high-throughput or multiplexed model scenarios. The code is well-commented, explaining the rationale behind the optimizations.

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Jan 14, 2026
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh marked this pull request as ready for review January 14, 2026 20:04
@abrarsheikh abrarsheikh requested a review from a team as a code owner January 14, 2026 20:04
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Jan 15, 2026
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Copy link
Contributor

@harshit-anyscale harshit-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great improvements, nice work!
left some comments, else LGTM

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh merged commit 00c877d into master Jan 16, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the opt-routing branch January 16, 2026 18:10
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 18, 2026
…#60139)

1. **O(1) Pending Request Lookups**
- Added dict indices (`_pending_requests_by_id` and
`_pending_requests_by_model_id`) for fast lookups
- Replaced O(n) linear scans with O(1) dict lookups when finding
requests by ID or multiplexed model

2. **Cached Replica List**
- Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on
every routing iteration
- List updated only when replicas change via `update_replicas()` or
`on_replica_actor_died()`

3. **Lazy Cleanup Strategy**
- Done futures are lazily cleaned from `_pending_requests_by_model_id`
during lookups using O(1) `popleft()`
   - Avoids expensive O(n) removal from deques

4. **Optimized Retry Insertion**
- Extracted sorted insertion logic into
`_insert_pending_request_sorted()` helper
   - O(1) fast path for common case (recent retries append to end)

5. **Simplified `pow_2_router`**
   - Removed redundant dict creation per routing call
- Direct lookup via `self._replicas[chosen_id]` instead of building
temporary map

<img width="1281" height="790" alt="image"
src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611"
/>

1. **`random.sample` → Direct Selection**
2. **Lazy Hash Caching** (`common.py`)
3. **Metrics Throttling** (`request_router.py`, `constants.py`)

<img width="1246" height="698" alt="image"
src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61"
/>

flamegraph of the router after all the optimization
<img width="3015" height="457" alt="image"
src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a"
/>

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
aslonnie pushed a commit that referenced this pull request Jan 21, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in #59233 after #60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…#60139)

1. **O(1) Pending Request Lookups**
- Added dict indices (`_pending_requests_by_id` and
`_pending_requests_by_model_id`) for fast lookups
- Replaced O(n) linear scans with O(1) dict lookups when finding
requests by ID or multiplexed model

2. **Cached Replica List**
- Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on
every routing iteration
- List updated only when replicas change via `update_replicas()` or
`on_replica_actor_died()`

3. **Lazy Cleanup Strategy**
- Done futures are lazily cleaned from `_pending_requests_by_model_id`
during lookups using O(1) `popleft()`
   - Avoids expensive O(n) removal from deques

4. **Optimized Retry Insertion**
- Extracted sorted insertion logic into
`_insert_pending_request_sorted()` helper
   - O(1) fast path for common case (recent retries append to end)

5. **Simplified `pow_2_router`**
   - Removed redundant dict creation per routing call
- Direct lookup via `self._replicas[chosen_id]` instead of building
temporary map

<img width="1281" height="790" alt="image"
src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611"
/>

1. **`random.sample` → Direct Selection**
2. **Lazy Hash Caching** (`common.py`)
3. **Metrics Throttling** (`request_router.py`, `constants.py`)

<img width="1246" height="698" alt="image"
src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61"
/>

flamegraph of the router after all the optimization
<img width="3015" height="457" alt="image"
src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a"
/>

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…#60139)

1. **O(1) Pending Request Lookups**
- Added dict indices (`_pending_requests_by_id` and
`_pending_requests_by_model_id`) for fast lookups
- Replaced O(n) linear scans with O(1) dict lookups when finding
requests by ID or multiplexed model

2. **Cached Replica List**
- Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on
every routing iteration
- List updated only when replicas change via `update_replicas()` or
`on_replica_actor_died()`

3. **Lazy Cleanup Strategy**
- Done futures are lazily cleaned from `_pending_requests_by_model_id`
during lookups using O(1) `popleft()`
   - Avoids expensive O(n) removal from deques

4. **Optimized Retry Insertion**
- Extracted sorted insertion logic into
`_insert_pending_request_sorted()` helper
   - O(1) fast path for common case (recent retries append to end)

5. **Simplified `pow_2_router`**
   - Removed redundant dict creation per routing call
- Direct lookup via `self._replicas[chosen_id]` instead of building
temporary map



<img width="1281" height="790" alt="image"
src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611"
/>


1. **`random.sample` → Direct Selection**
2. **Lazy Hash Caching** (`common.py`)
3. **Metrics Throttling** (`request_router.py`, `constants.py`)

<img width="1246" height="698" alt="image"
src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61"
/>

flamegraph of the router after all the optimization
<img width="3015" height="457" alt="image"
src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a"
/>

---------

Signed-off-by: abrar <abrar@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…#60139)

1. **O(1) Pending Request Lookups**
- Added dict indices (`_pending_requests_by_id` and
`_pending_requests_by_model_id`) for fast lookups
- Replaced O(n) linear scans with O(1) dict lookups when finding
requests by ID or multiplexed model

2. **Cached Replica List**
- Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on
every routing iteration
- List updated only when replicas change via `update_replicas()` or
`on_replica_actor_died()`

3. **Lazy Cleanup Strategy**
- Done futures are lazily cleaned from `_pending_requests_by_model_id`
during lookups using O(1) `popleft()`
   - Avoids expensive O(n) removal from deques

4. **Optimized Retry Insertion**
- Extracted sorted insertion logic into
`_insert_pending_request_sorted()` helper
   - O(1) fast path for common case (recent retries append to end)

5. **Simplified `pow_2_router`**
   - Removed redundant dict creation per routing call
- Direct lookup via `self._replicas[chosen_id]` instead of building
temporary map

<img width="1281" height="790" alt="image"
src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611"
/>

1. **`random.sample` → Direct Selection**
2. **Lazy Hash Caching** (`common.py`)
3. **Metrics Throttling** (`request_router.py`, `constants.py`)

<img width="1246" height="698" alt="image"
src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61"
/>

flamegraph of the router after all the optimization
<img width="3015" height="457" alt="image"
src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a"
/>

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…#60139)

1. **O(1) Pending Request Lookups**
- Added dict indices (`_pending_requests_by_id` and
`_pending_requests_by_model_id`) for fast lookups
- Replaced O(n) linear scans with O(1) dict lookups when finding
requests by ID or multiplexed model

2. **Cached Replica List**
- Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on
every routing iteration
- List updated only when replicas change via `update_replicas()` or
`on_replica_actor_died()`

3. **Lazy Cleanup Strategy**
- Done futures are lazily cleaned from `_pending_requests_by_model_id`
during lookups using O(1) `popleft()`
   - Avoids expensive O(n) removal from deques

4. **Optimized Retry Insertion**
- Extracted sorted insertion logic into
`_insert_pending_request_sorted()` helper
   - O(1) fast path for common case (recent retries append to end)

5. **Simplified `pow_2_router`**
   - Removed redundant dict creation per routing call
- Direct lookup via `self._replicas[chosen_id]` instead of building
temporary map

<img width="1281" height="790" alt="image"
src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611"
/>

1. **`random.sample` → Direct Selection**
2. **Lazy Hash Caching** (`common.py`)
3. **Metrics Throttling** (`request_router.py`, `constants.py`)

<img width="1246" height="698" alt="image"
src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61"
/>

flamegraph of the router after all the optimization
<img width="3015" height="457" alt="image"
src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a"
/>

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants