Skip to content

[core][autoscaler] Extend instance allocation timeout in autoscaler v2#60392

Merged
edoakes merged 3 commits intoray-project:masterfrom
rueian:extend-autoscaler-default-allocation-timeout
Jan 22, 2026
Merged

[core][autoscaler] Extend instance allocation timeout in autoscaler v2#60392
edoakes merged 3 commits intoray-project:masterfrom
rueian:extend-autoscaler-default-allocation-timeout

Conversation

@rueian
Copy link
Contributor

@rueian rueian commented Jan 21, 2026

Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout, for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559

https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@rueian rueian requested a review from a team as a code owner January 21, 2026 23:40
@rueian rueian added core-autoscaler autoscaler related issues core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Jan 21, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the allocate_status_timeout_s in Autoscaler v2 from 5 minutes to 1 hour. This is a necessary change to support rare instance types like GPUs and TPUs that can have long provisioning times. The change is simple and correct. I've added one comment to improve the readability of the new timeout value and to suggest clarifying the associated comment, which appears to be slightly misleading about what this timeout covers. Overall, this is a good improvement for the robustness of the autoscaler.

rueian and others added 2 commits January 21, 2026 15:46
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
@edoakes edoakes merged commit 542fd29 into ray-project:master Jan 22, 2026
6 checks passed
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
ray-project#60392)

## Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and
running. This timeout is often insufficient for rare instance types,
such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout,
for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559

https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making
Autoscaler v2 more robust for slow-provisioning instance types.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
ray-project#60392)

## Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and
running. This timeout is often insufficient for rare instance types,
such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout,
for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559

https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making
Autoscaler v2 more robust for slow-provisioning instance types.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
ray-project#60392)

## Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and
running. This timeout is often insufficient for rare instance types,
such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout,
for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559

https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making
Autoscaler v2 more robust for slow-provisioning instance types.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
ray-project#60392)

## Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and
running. This timeout is often insufficient for rare instance types,
such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout,
for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559

https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making
Autoscaler v2 more robust for slow-provisioning instance types.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants