[Serve] Optimize replica routing request data structures#60139
[Serve] Optimize replica routing request data structures#60139abrarsheikh merged 11 commits intomasterfrom
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request significantly optimizes the replica routing mechanism in Ray Serve by refactoring data structures and lookup logic. The changes introduce dictionary-based indices (_pending_requests_by_id, _pending_requests_by_model_id) for O(1) lookups of pending requests, replacing previous O(N) iterations over deques. Lazy cleanup of completed futures is implemented to prevent memory leaks, and a cached list of replicas (_replicas_list) is maintained to avoid redundant list conversions. These improvements enhance the efficiency of request matching, fulfillment, and replica selection, leading to better performance, especially in high-throughput or multiplexed model scenarios. The code is well-commented, explaining the rationale behind the optimizations.
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
harshit-anyscale
left a comment
There was a problem hiding this comment.
great improvements, nice work!
left some comments, else LGTM
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in #59233 after #60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…#60139) 1. **O(1) Pending Request Lookups** - Added dict indices (`_pending_requests_by_id` and `_pending_requests_by_model_id`) for fast lookups - Replaced O(n) linear scans with O(1) dict lookups when finding requests by ID or multiplexed model 2. **Cached Replica List** - Added `_replicas_list` cache to avoid O(n) dict-to-list conversion on every routing iteration - List updated only when replicas change via `update_replicas()` or `on_replica_actor_died()` 3. **Lazy Cleanup Strategy** - Done futures are lazily cleaned from `_pending_requests_by_model_id` during lookups using O(1) `popleft()` - Avoids expensive O(n) removal from deques 4. **Optimized Retry Insertion** - Extracted sorted insertion logic into `_insert_pending_request_sorted()` helper - O(1) fast path for common case (recent retries append to end) 5. **Simplified `pow_2_router`** - Removed redundant dict creation per routing call - Direct lookup via `self._replicas[chosen_id]` instead of building temporary map <img width="1281" height="790" alt="image" src="https://github.com/user-attachments/assets/065102e7-739e-47ad-b3cc-60651f455611" /> 1. **`random.sample` → Direct Selection** 2. **Lazy Hash Caching** (`common.py`) 3. **Metrics Throttling** (`request_router.py`, `constants.py`) <img width="1246" height="698" alt="image" src="https://github.com/user-attachments/assets/6b21a7e4-0c3e-42b5-80bb-5e20cc8acc61" /> flamegraph of the router after all the optimization <img width="3015" height="457" alt="image" src="https://github.com/user-attachments/assets/0f8cf8ec-caf6-420d-bc8c-a8fcdca0c56a" /> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
O(1) Pending Request Lookups
_pending_requests_by_idand_pending_requests_by_model_id) for fast lookupsCached Replica List
_replicas_listcache to avoid O(n) dict-to-list conversion on every routing iterationupdate_replicas()oron_replica_actor_died()Lazy Cleanup Strategy
_pending_requests_by_model_idduring lookups using O(1)popleft()Optimized Retry Insertion
_insert_pending_request_sorted()helperSimplified
pow_2_routerself._replicas[chosen_id]instead of building temporary maprandom.sample→ Direct Selectioncommon.py)request_router.py,constants.py)flamegraph of the router after all the optimization
