[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data by bveeramani · Pull Request #59362 · ray-project/ray

bveeramani · 2025-12-10T22:25:39Z

This PR introduces a new cluster autoscaler implementation (DefaultClusterAutoscalerV2) that takes a fundamentally different approach to scaling decisions compared to the existing V1 autoscaler.

Motivation

The current V1 cluster autoscaler requests resources based on incremental_resource_usage(), which represents task-level resource bundles. This approach has limitations:

Indirect scaling: Requesting task-sized resource bundles doesn't directly translate to node additions, making it harder to predict and control cluster scaling behavior.
No utilization awareness: V1 doesn't consider actual cluster utilization. It can request more resources even when the cluster is underutilized, or fail to scale when genuinely needed.
Lack of smoothing: Without time-windowed averaging, momentary spikes or drops in resource demand can trigger unnecessary scaling events.

The V2 autoscaler addresses these issues by:

Monitoring actual utilization: Tracks average CPU and memory utilization over a configurable time window (default 10 seconds) to make informed scaling decisions.
Requesting whole nodes: Instead of task-sized bundles, V2 requests resources matching actual node specs in the cluster. This gives clearer signals to Ray's autoscaler about desired node count.
Threshold-based scaling: Only triggers scale-up when utilization exceeds a threshold (default 75%), preventing premature scaling while ensuring capacity when genuinely needed.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist

Code Review

This pull request introduces an initial implementation of a new default cluster autoscaler for Ray Data. The new version (V2) is selectable via an environment variable and bases its scaling decisions on the average cluster utilization over a time window. The implementation is well-structured, includes a new TimeWindowAverageCalculator utility, and is accompanied by thorough unit tests. The code is clean and the logic appears sound. I have one minor suggestion to correct a constant used in a log message to avoid potential confusion during debugging.

gemini-code-assist · 2025-12-10T22:26:47Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+            msg = (
+                f"Failed to cancel resource request for {self._requester_id}."
+                " The request will still expire after the timeout of"
+                f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."


The log message incorrectly uses MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS for the request expiration timeout. The actual timeout is set by AUTOSCALING_REQUEST_EXPIRE_TIME_S when request_resources is called. Using the correct constant here will ensure the log message is accurate and not misleading during debugging.

Suggested change

f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."

f" {self.AUTOSCALING_REQUEST_EXPIRE_TIME_S} seconds."

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

cursor · 2025-12-11T05:28:05Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+                f"Failed to cancel resource request for {self._requester_id}."
+                " The request will still expire after the timeout of"
+                f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."
+            )


Bug: Wrong timeout constant used in warning message

The warning message incorrectly uses MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS (10 seconds) when describing the request expiration timeout. The actual expiration time used when sending requests is AUTOSCALING_REQUEST_EXPIRE_TIME_S (180 seconds). This causes the warning to display incorrect information to users about when their request will expire.

cursor · 2025-12-11T05:28:05Z

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

+        ray.init()
+
+    def teardown_class(self):
+        ray.shutdown()


Bug: Test class uses wrong method names for setup/teardown

The test class inherits from unittest.TestCase but defines setup_class(self) and teardown_class(self) methods. For unittest.TestCase, the correct names are setUpClass and tearDownClass, which must be class methods decorated with @classmethod and receive cls instead of self. As written, these methods won't be invoked by the test runner, so ray.init() won't be called and the instance attributes like _node_type1 won't be set, causing tests to fail with AttributeError.

…Ray Data (ray-project#59362) This PR introduces a new cluster autoscaler implementation (DefaultClusterAutoscalerV2) that takes a fundamentally different approach to scaling decisions compared to the existing V1 autoscaler. **Motivation** The current V1 cluster autoscaler requests resources based on `incremental_resource_usage()`, which represents task-level resource bundles. This approach has limitations: 1. Indirect scaling: Requesting task-sized resource bundles doesn't directly translate to node additions, making it harder to predict and control cluster scaling behavior. 2. No utilization awareness: V1 doesn't consider actual cluster utilization. It can request more resources even when the cluster is underutilized, or fail to scale when genuinely needed. 3. Lack of smoothing: Without time-windowed averaging, momentary spikes or drops in resource demand can trigger unnecessary scaling events. The V2 autoscaler addresses these issues by: - Monitoring actual utilization: Tracks average CPU and memory utilization over a configurable time window (default 10 seconds) to make informed scaling decisions. - Requesting whole nodes: Instead of task-sized bundles, V2 requests resources matching actual node specs in the cluster. This gives clearer signals to Ray's autoscaler about desired node count. - Threshold-based scaling: Only triggers scale-up when utilization exceeds a threshold (default 75%), preventing premature scaling while ensuring capacity when genuinely needed. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>

bveeramani added 5 commits December 10, 2025 11:05

Initial commit

9b7fe22

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Remove dead file

3e536c0

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Rename environment variables

f787584

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Appease lint

a4e1c45

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Initial commit

8300379

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani requested a review from a team as a code owner December 10, 2025 22:25

bveeramani marked this pull request as draft December 10, 2025 22:25

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

Address review comments

ddd8410

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

iamjustinhsu approved these changes Dec 10, 2025

View reviewed changes

bveeramani changed the title ~~[Data] Add initial implementation of new default cluster autoscaler~~ [Data][2/N] Add initial implementation of new default cluster autoscaler Dec 10, 2025

bveeramani changed the title ~~[Data][2/N] Add initial implementation of new default cluster autoscaler~~ [Data][Autoscaler][2/N] Add initial implementation of new default cluster autoscaler Dec 10, 2025

Add test to BAZEL file

06825ce

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani added the go add ONLY when ready to merge, run all tests label Dec 11, 2025

Base automatically changed from autoscaling-coordinator to master December 11, 2025 00:46

Resolve merge conflicts

ea0ef90

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani mentioned this pull request Dec 11, 2025

[Data] Ray keeps adding nodes beyond Dataset.map concurrency #52573

Closed

bveeramani changed the title ~~[Data][Autoscaler][2/N] Add initial implementation of new default cluster autoscaler~~ [Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data Dec 11, 2025

bveeramani marked this pull request as ready for review December 11, 2025 05:21

bveeramani merged commit 64f4168 into master Dec 11, 2025
5 checks passed

bveeramani deleted the new-default-autoscaler branch December 11, 2025 05:22

cursor bot reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data #59362

[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data #59362
bveeramani merged 8 commits intomasterfrom
new-default-autoscaler

bveeramani commented Dec 10, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 10, 2025

Uh oh!

Uh oh!

cursor bot Dec 11, 2025

Uh oh!

cursor bot Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."
	f" {self.AUTOSCALING_REQUEST_EXPIRE_TIME_S} seconds."

Conversation

bveeramani commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Dec 11, 2025

Choose a reason for hiding this comment

Bug: Wrong timeout constant used in warning message

Uh oh!

cursor bot Dec 11, 2025

Choose a reason for hiding this comment

Bug: Test class uses wrong method names for setup/teardown

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bveeramani commented Dec 10, 2025 •

edited

Loading