[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data #59362
[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data #59362bveeramani merged 8 commits intomasterfrom
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Code Review
This pull request introduces an initial implementation of a new default cluster autoscaler for Ray Data. The new version (V2) is selectable via an environment variable and bases its scaling decisions on the average cluster utilization over a time window. The implementation is well-structured, includes a new TimeWindowAverageCalculator utility, and is accompanied by thorough unit tests. The code is clean and the logic appears sound. I have one minor suggestion to correct a constant used in a log message to avoid potential confusion during debugging.
| msg = ( | ||
| f"Failed to cancel resource request for {self._requester_id}." | ||
| " The request will still expire after the timeout of" | ||
| f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds." |
There was a problem hiding this comment.
The log message incorrectly uses MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS for the request expiration timeout. The actual timeout is set by AUTOSCALING_REQUEST_EXPIRE_TIME_S when request_resources is called. Using the correct constant here will ensure the log message is accurate and not misleading during debugging.
| f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds." | |
| f" {self.AUTOSCALING_REQUEST_EXPIRE_TIME_S} seconds." |
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| f"Failed to cancel resource request for {self._requester_id}." | ||
| " The request will still expire after the timeout of" | ||
| f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds." | ||
| ) |
There was a problem hiding this comment.
Bug: Wrong timeout constant used in warning message
The warning message incorrectly uses MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS (10 seconds) when describing the request expiration timeout. The actual expiration time used when sending requests is AUTOSCALING_REQUEST_EXPIRE_TIME_S (180 seconds). This causes the warning to display incorrect information to users about when their request will expire.
| ray.init() | ||
|
|
||
| def teardown_class(self): | ||
| ray.shutdown() |
There was a problem hiding this comment.
Bug: Test class uses wrong method names for setup/teardown
The test class inherits from unittest.TestCase but defines setup_class(self) and teardown_class(self) methods. For unittest.TestCase, the correct names are setUpClass and tearDownClass, which must be class methods decorated with @classmethod and receive cls instead of self. As written, these methods won't be invoked by the test runner, so ray.init() won't be called and the instance attributes like _node_type1 won't be set, causing tests to fail with AttributeError.
…Ray Data (ray-project#59362) This PR introduces a new cluster autoscaler implementation (DefaultClusterAutoscalerV2) that takes a fundamentally different approach to scaling decisions compared to the existing V1 autoscaler. **Motivation** The current V1 cluster autoscaler requests resources based on `incremental_resource_usage()`, which represents task-level resource bundles. This approach has limitations: 1. Indirect scaling: Requesting task-sized resource bundles doesn't directly translate to node additions, making it harder to predict and control cluster scaling behavior. 2. No utilization awareness: V1 doesn't consider actual cluster utilization. It can request more resources even when the cluster is underutilized, or fail to scale when genuinely needed. 3. Lack of smoothing: Without time-windowed averaging, momentary spikes or drops in resource demand can trigger unnecessary scaling events. The V2 autoscaler addresses these issues by: - Monitoring actual utilization: Tracks average CPU and memory utilization over a configurable time window (default 10 seconds) to make informed scaling decisions. - Requesting whole nodes: Instead of task-sized bundles, V2 requests resources matching actual node specs in the cluster. This gives clearer signals to Ray's autoscaler about desired node count. - Threshold-based scaling: Only triggers scale-up when utilization exceeds a threshold (default 75%), preventing premature scaling while ensuring capacity when genuinely needed. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
This PR introduces a new cluster autoscaler implementation (DefaultClusterAutoscalerV2) that takes a fundamentally different approach to scaling decisions compared to the existing V1 autoscaler.
Motivation
The current V1 cluster autoscaler requests resources based on
incremental_resource_usage(), which represents task-level resource bundles. This approach has limitations:The V2 autoscaler addresses these issues by: