Skip to content

Commit c2ec21e

Browse files
[Serve] Add default autoscaling params for custom policies (#58857)
## Description Currently, custom autoscaling policies bypass all standard autoscaling configuration parameters - they must manually implement delay logic, scaling factors, and bounds checking themselves. This PR adds an `apply_autoscaling_config` decorator that enables custom autoscaling policies to automatically benefit from Ray Serve's standard autoscaling parameters that are embedded in the default policy (`upscale_delay_s,` `downscale_delay_s`, `downscale_to_zero_delay_s`, `upscaling_factor`, `downscaling_factor`, `min_replicas`, `max_replicas`). ## Related issues Fixes #58622 ## Implementation Details - Core implementation (python/ray/serve/autoscaling_policy.py): - Added `apply_autoscaling_config decorator` - Refactored delay logic into `_apply_delay_logic()` helper function - Added scaling factor logic for custom policies`_apply_scaling_factors()` helper function - Refactored bounds checking into` _apply_bounds()` helper function - Updated replica_queue_length_autoscaling_policy to use `_apply_delay_logic` function - Tests (python/ray/serve/tests/test_autoscaling_policy.py and python/ray/tests/unit/test_autoscaling_policy.py) - End-to-end tests verifying delay enforcement for decorated custom policies - Tests for scaling factor moderation (upscaling and downscaling) - Unit tests for checking each helper function Added documentation for usage with example --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: harshit-anyscale <harshit@anyscale.com>
1 parent 4f1ce10 commit c2ec21e

File tree

10 files changed

+1171
-511
lines changed

10 files changed

+1171
-511
lines changed

ci/lint/pydoclint-baseline.txt

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1609,11 +1609,6 @@ python/ray/serve/api.py
16091609
DOC103: Function `get_deployment_handle`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [_check_exists: bool, _record_telemetry: bool].
16101610
DOC201: Function `get_deployment_handle` does not have a return section in docstring
16111611
--------------------
1612-
python/ray/serve/autoscaling_policy.py
1613-
DOC101: Function `_calculate_desired_num_replicas`: Docstring contains fewer arguments than in function signature.
1614-
DOC111: Function `_calculate_desired_num_replicas`: The option `--arg-type-hints-in-docstring` is `False` but there are type hints in the docstring arg list
1615-
DOC103: Function `_calculate_desired_num_replicas`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [num_running_replicas: int, total_num_requests: int]. Arguments in the docstring but not in the function signature: [current_num_ongoing_requests: List[float]].
1616-
--------------------
16171612
python/ray/serve/batching.py
16181613
DOC111: Method `_BatchQueue.__init__`: The option `--arg-type-hints-in-docstring` is `False` but there are type hints in the docstring arg list
16191614
DOC101: Function `batch`: Docstring contains fewer arguments than in function signature.

doc/source/serve/advanced-guides/advanced-autoscaling.md

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -637,11 +637,41 @@ Policies are defined **per deployment**. If you don’t provide one, Ray Serve f
637637

638638
The policy function is invoked by the Ray Serve controller every `RAY_SERVE_CONTROL_LOOP_INTERVAL_S` seconds (default **0.1s**), so your logic runs against near-real-time state.
639639

640+
Your policy can return an `int` or a `float` for `target_replicas`. If it returns a float, Ray Serve converts it to an integer replica count by rounding up to the next greatest integer.
641+
640642
:::{warning}
641643
Keep policy functions **fast and lightweight**. Slow logic can block the Serve controller and degrade cluster responsiveness.
642644
:::
643645

644646

647+
### Applying standard autoscaling parameters to custom policies
648+
649+
Ray Serve automatically applies the following standard autoscaling parameters from your [`AutoscalingConfig`](../api/doc/ray.serve.config.AutoscalingConfig.rst) to custom policies:
650+
- `upscale_delay_s`, `downscale_delay_s`, `downscale_to_zero_delay_s`
651+
- `upscaling_factor`, `downscaling_factor`
652+
- `min_replicas`, `max_replicas`
653+
654+
The following example shows a custom autoscaling policy with standard autoscaling parameters applied.
655+
656+
```{literalinclude} ../doc_code/autoscaling_policy.py
657+
:language: python
658+
:start-after: __begin_apply_autoscaling_config_example__
659+
:end-before: __end_apply_autoscaling_config_example__
660+
```
661+
662+
```{literalinclude} ../doc_code/autoscaling_policy.py
663+
:language: python
664+
:start-after: __begin_apply_autoscaling_config_usage__
665+
:end-before: __end_apply_autoscaling_config_usage__
666+
```
667+
668+
::::{note}
669+
Your policy function should return the "raw" desired number of replicas. Ray Serve applies the `autoscaling_config` settings (delays, factors, and bounds) on top of your decision.
670+
671+
Your policy can return an `int` or a `float` "raw desired" replica count. Ray Serve returns an integer decision number.
672+
::::
673+
674+
645675
### Custom metrics
646676

647677
You can make richer decisions by emitting your own metrics from the deployment. Implement `record_autoscaling_stats()` to return a `dict[str, float]`. Ray Serve will surface these values in the [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst).
@@ -681,9 +711,10 @@ By default, each deployment in Ray Serve autoscales independently. When you have
681711

682712
An application-level autoscaling policy is a function that takes a `dict[DeploymentID, AutoscalingContext]` objects (one per deployment) and returns a tuple of `(decisions, policy_state)`. Each context contains metrics and bounds for one deployment, and the policy returns target replica counts for all deployments.
683713

684-
The `policy_state` returned from an application-level policy must be a `dict[DeploymentID, dict]`— a dictionary mapping each deployment ID to its own state dictionary. Serve stores this per-deployment state and on the next control-loop iteration, injects each deployment's state back into that deployment's `AutoscalingContext.policy_state`.
714+
The `policy_state` returned from an application-level policy must be a `Dict[DeploymentID, Dict]`— a dictionary mapping each deployment ID to its own state dictionary. Serve stores this per-deployment state and on the next control-loop iteration, injects each deployment's state back into that deployment's `AutoscalingContext.policy_state`.
715+
The per deployment number replicas returned from the policy can be an `int` or a `float`. If it returns a float, Ray Serve converts it to an integer replica count by rounding up to the next greatest integer.
685716

686-
Serve itself does not interpret the contents of `policy_state`. All the keys in each deployment's state dictionary are user-controlled.
717+
Serve itself does not interpret the contents of `policy_state`. All the keys in each deployment's state dictionary are user-controlled except for internal keys that are used when default parameters are applied to custom autoscaling policies.
687718
The following example shows a policy that scales deployments based on their relative load, ensuring that downstream deployments have enough capacity for upstream traffic:
688719

689720
`autoscaling_policy.py` file:
@@ -728,6 +759,19 @@ Programmatic configuration of application-level autoscaling policies through `se
728759
When you specify both a deployment-level policy and an application-level policy, the application-level policy takes precedence. Ray Serve logs a warning if you configure both.
729760
:::
730761

762+
763+
#### Applying standard autoscaling parameters to application-level policies
764+
Ray Serve automatically applies standard autoscaling parameters (delays, factors, and min/max bounds) to application-level policies on a per-deployment basis.
765+
These parameters include:
766+
- `upscale_delay_s`, `downscale_delay_s`, `downscale_to_zero_delay_s`
767+
- `upscaling_factor`, `downscaling_factor`
768+
- `min_replicas`, `max_replicas`
769+
770+
The YAML configuration file shows the default parameters applied to the application level policy.
771+
```{literalinclude} ../doc_code/application_level_autoscaling_with_defaults.yaml
772+
:language: yaml
773+
```
774+
Your application level policy can return per deployment desired replicas as `int` or `float` values. Ray Serve applies the autoscaling config parameters per deployment and returns integer decisions.
731775
:::{warning}
732776
### Gotchas and limitations
733777

doc/source/serve/api/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ See the [model composition guide](serve-model-composition) for how to update cod
8686
serve.config.AutoscalingConfig
8787
serve.config.AutoscalingPolicy
8888
serve.config.AutoscalingContext
89+
serve.autoscaling_policy.replica_queue_length_autoscaling_policy
8990
serve.config.AggregationFunction
9091
serve.config.RequestRouterConfig
9192
```
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
applications:
2+
- name: MyApp
3+
import_path: application_level_autoscaling:app
4+
autoscaling_policy:
5+
policy_function: autoscaling_policy:coordinated_scaling_policy_with_defaults
6+
deployments:
7+
- name: Preprocessor
8+
autoscaling_config:
9+
min_replicas: 1
10+
max_replicas: 10
11+
target_ongoing_requests: 1
12+
upscale_delay_s: 2
13+
downscale_delay_s: 5
14+
- name: Model
15+
autoscaling_config:
16+
min_replicas: 2
17+
max_replicas: 20
18+
target_ongoing_requests: 1
19+
upscale_delay_s: 3
20+
downscale_delay_s: 5

doc/source/serve/doc_code/autoscaling_policy.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,3 +131,48 @@ def stateful_application_level_policy(
131131

132132

133133
# __end_stateful_application_level_policy__
134+
# __begin_apply_autoscaling_config_example__
135+
from typing import Any, Dict
136+
from ray.serve.config import AutoscalingContext
137+
138+
139+
def queue_length_based_autoscaling_policy(
140+
ctx: AutoscalingContext,
141+
) -> tuple[int, Dict[str, Any]]:
142+
# This policy calculates the "raw" desired replicas based on queue length.
143+
# Ray Serve automatically applies scaling factors, delays, and bounds from
144+
# the deployment's autoscaling_config on top of this decision.
145+
146+
queue_length = ctx.total_num_requests
147+
148+
if queue_length > 50:
149+
return 10, {}
150+
elif queue_length > 10:
151+
return 5, {}
152+
else:
153+
return 0, {}
154+
# __end_apply_autoscaling_config_example__
155+
156+
# __begin_apply_autoscaling_config_usage__
157+
from ray import serve
158+
from ray.serve.config import AutoscalingConfig, AutoscalingPolicy
159+
160+
@serve.deployment(
161+
autoscaling_config=AutoscalingConfig(
162+
min_replicas=1,
163+
max_replicas=10,
164+
metrics_interval_s=0.1,
165+
upscale_delay_s=1.0,
166+
downscale_delay_s=1.0,
167+
policy=AutoscalingPolicy(
168+
policy_function=queue_length_based_autoscaling_policy
169+
)
170+
),
171+
max_ongoing_requests=5,
172+
)
173+
class MyDeployment:
174+
def __call__(self) -> str:
175+
return "Hello, world!"
176+
177+
app = MyDeployment.bind()
178+
# __end_apply_autoscaling_config_usage__

python/ray/serve/_private/autoscaling_state.py

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
import logging
2+
import math
23
import time
34
from collections import defaultdict
4-
from typing import Any, Callable, Dict, List, Optional, Set, Tuple
5+
from typing import Any, Callable, Dict, List, Optional, Set, Tuple, Union
56

67
from ray.serve._private.common import (
78
RUNNING_REQUESTS_KEY,
@@ -28,6 +29,10 @@
2829
)
2930
from ray.serve._private.usage import ServeUsageTag
3031
from ray.serve._private.utils import get_capacity_adjusted_num_replicas
32+
from ray.serve.autoscaling_policy import (
33+
_apply_app_level_autoscaling_config,
34+
_apply_autoscaling_config,
35+
)
3136
from ray.serve.config import AutoscalingContext, AutoscalingPolicy
3237
from ray.util import metrics
3338

@@ -52,7 +57,9 @@ def __init__(self, deployment_id: DeploymentID):
5257
self._deployment_info = None
5358
self._config = None
5459
self._policy: Optional[
55-
Callable[[AutoscalingContext], Tuple[int, Optional[Dict[str, Any]]]]
60+
Callable[
61+
[AutoscalingContext], Tuple[Union[int, float], Optional[Dict[str, Any]]]
62+
]
5663
] = None
5764
# user defined policy returns a dictionary of state that is persisted between autoscaling decisions
5865
# content of the dictionary is determined by the user defined policy
@@ -113,7 +120,8 @@ def register(self, info: DeploymentInfo, curr_target_num_replicas: int) -> int:
113120

114121
self._deployment_info = info
115122
self._config = config
116-
self._policy = self._config.policy.get_policy()
123+
# Apply default autoscaling config to the policy
124+
self._policy = _apply_autoscaling_config(self._config.policy.get_policy())
117125
self._target_capacity = info.target_capacity
118126
self._target_capacity_direction = info.target_capacity_direction
119127
self._policy_state = {}
@@ -305,6 +313,9 @@ def get_decision_num_replicas(
305313
# Time the policy execution
306314
start_time = time.time()
307315
decision_num_replicas, self._policy_state = self._policy(autoscaling_context)
316+
# The policy can return a float value.
317+
if isinstance(decision_num_replicas, float):
318+
decision_num_replicas = math.ceil(decision_num_replicas)
308319
policy_execution_time_ms = (time.time() - start_time) * 1000
309320

310321
self.record_autoscaling_metrics(
@@ -815,7 +826,10 @@ def __init__(
815826
self._policy: Optional[
816827
Callable[
817828
[Dict[DeploymentID, AutoscalingContext]],
818-
Tuple[Dict[DeploymentID, int], Optional[Dict[DeploymentID, Dict]]],
829+
Tuple[
830+
Dict[DeploymentID, Union[int, float]],
831+
Optional[Dict[DeploymentID, Dict]],
832+
],
819833
]
820834
] = None
821835
# user defined policy returns a dictionary of state that is persisted between autoscaling decisions
@@ -837,7 +851,10 @@ def register(
837851
Args:
838852
autoscaling_policy: The autoscaling policy to register.
839853
"""
840-
self._policy = autoscaling_policy.get_policy()
854+
# Apply default autoscaling config to the policy
855+
self._policy = _apply_app_level_autoscaling_config(
856+
autoscaling_policy.get_policy()
857+
)
841858
self._policy_state = {}
842859

843860
# Log when custom autoscaling policy is used for application
@@ -974,10 +991,10 @@ def get_decision_num_replicas(
974991
)
975992
results[deployment_id] = (
976993
self._deployment_autoscaling_states[deployment_id].apply_bounds(
977-
num_replicas
994+
math.ceil(num_replicas)
978995
)
979996
if not _skip_bound_check
980-
else num_replicas
997+
else math.ceil(num_replicas)
981998
)
982999
return results
9831000
else:

0 commit comments

Comments
 (0)