Skip to content

[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data #59362

Merged
bveeramani merged 8 commits intomasterfrom
new-default-autoscaler
Dec 11, 2025
Merged

[Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data #59362
bveeramani merged 8 commits intomasterfrom
new-default-autoscaler

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Dec 10, 2025

This PR introduces a new cluster autoscaler implementation (DefaultClusterAutoscalerV2) that takes a fundamentally different approach to scaling decisions compared to the existing V1 autoscaler.

Motivation

The current V1 cluster autoscaler requests resources based on incremental_resource_usage(), which represents task-level resource bundles. This approach has limitations:

  1. Indirect scaling: Requesting task-sized resource bundles doesn't directly translate to node additions, making it harder to predict and control cluster scaling behavior.
  2. No utilization awareness: V1 doesn't consider actual cluster utilization. It can request more resources even when the cluster is underutilized, or fail to scale when genuinely needed.
  3. Lack of smoothing: Without time-windowed averaging, momentary spikes or drops in resource demand can trigger unnecessary scaling events.

The V2 autoscaler addresses these issues by:

  • Monitoring actual utilization: Tracks average CPU and memory utilization over a configurable time window (default 10 seconds) to make informed scaling decisions.
  • Requesting whole nodes: Instead of task-sized bundles, V2 requests resources matching actual node specs in the cluster. This gives clearer signals to Ray's autoscaler about desired node count.
  • Threshold-based scaling: Only triggers scale-up when utilization exceeds a threshold (default 75%), preventing premature scaling while ensuring capacity when genuinely needed.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner December 10, 2025 22:25
@bveeramani bveeramani marked this pull request as draft December 10, 2025 22:25
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an initial implementation of a new default cluster autoscaler for Ray Data. The new version (V2) is selectable via an environment variable and bases its scaling decisions on the average cluster utilization over a time window. The implementation is well-structured, includes a new TimeWindowAverageCalculator utility, and is accompanied by thorough unit tests. The code is clean and the logic appears sound. I have one minor suggestion to correct a constant used in a log message to avoid potential confusion during debugging.

msg = (
f"Failed to cancel resource request for {self._requester_id}."
" The request will still expire after the timeout of"
f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message incorrectly uses MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS for the request expiration timeout. The actual timeout is set by AUTOSCALING_REQUEST_EXPIRE_TIME_S when request_resources is called. Using the correct constant here will ensure the log message is accurate and not misleading during debugging.

Suggested change
f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."
f" {self.AUTOSCALING_REQUEST_EXPIRE_TIME_S} seconds."

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani changed the title [Data] Add initial implementation of new default cluster autoscaler [Data][2/N] Add initial implementation of new default cluster autoscaler Dec 10, 2025
@bveeramani bveeramani changed the title [Data][2/N] Add initial implementation of new default cluster autoscaler [Data][Autoscaler][2/N] Add initial implementation of new default cluster autoscaler Dec 10, 2025
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Dec 11, 2025
Base automatically changed from autoscaling-coordinator to master December 11, 2025 00:46
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani changed the title [Data][Autoscaler][2/N] Add initial implementation of new default cluster autoscaler [Data][Autoscaler][2/N] Add utilization-based cluster autoscaler for Ray Data Dec 11, 2025
@bveeramani bveeramani marked this pull request as ready for review December 11, 2025 05:21
@bveeramani bveeramani merged commit 64f4168 into master Dec 11, 2025
5 checks passed
@bveeramani bveeramani deleted the new-default-autoscaler branch December 11, 2025 05:22
f"Failed to cancel resource request for {self._requester_id}."
" The request will still expire after the timeout of"
f" {self.MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS} seconds."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Wrong timeout constant used in warning message

The warning message incorrectly uses MIN_GAP_BETWEEN_AUTOSCALING_REQUESTS (10 seconds) when describing the request expiration timeout. The actual expiration time used when sending requests is AUTOSCALING_REQUEST_EXPIRE_TIME_S (180 seconds). This causes the warning to display incorrect information to users about when their request will expire.

Fix in Cursor Fix in Web

ray.init()

def teardown_class(self):
ray.shutdown()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test class uses wrong method names for setup/teardown

The test class inherits from unittest.TestCase but defines setup_class(self) and teardown_class(self) methods. For unittest.TestCase, the correct names are setUpClass and tearDownClass, which must be class methods decorated with @classmethod and receive cls instead of self. As written, these methods won't be invoked by the test runner, so ray.init() won't be called and the instance attributes like _node_type1 won't be set, causing tests to fail with AttributeError.

Fix in Cursor Fix in Web

peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…Ray Data (ray-project#59362)

This PR introduces a new cluster autoscaler implementation
(DefaultClusterAutoscalerV2) that takes a fundamentally different
approach to scaling decisions compared to the existing V1 autoscaler.

**Motivation**

The current V1 cluster autoscaler requests resources based on
`incremental_resource_usage()`, which represents task-level resource
bundles. This approach has limitations:

1. Indirect scaling: Requesting task-sized resource bundles doesn't
directly translate to node additions, making it harder to predict and
control cluster scaling behavior.
2. No utilization awareness: V1 doesn't consider actual cluster
utilization. It can request more resources even when the cluster is
underutilized, or fail to scale when genuinely needed.
3. Lack of smoothing: Without time-windowed averaging, momentary spikes
or drops in resource demand can trigger unnecessary scaling events.

  The V2 autoscaler addresses these issues by:

- Monitoring actual utilization: Tracks average CPU and memory
utilization over a configurable time window (default 10 seconds) to make
informed scaling decisions.
- Requesting whole nodes: Instead of task-sized bundles, V2 requests
resources matching actual node specs in the cluster. This gives clearer
signals to Ray's autoscaler about desired node count.
- Threshold-based scaling: Only triggers scale-up when utilization
exceeds a threshold (default 75%), preventing premature scaling while
ensuring capacity when genuinely needed.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants