[core][autoscaler] Extend instance allocation timeout in autoscaler v2 by rueian · Pull Request #60392 · ray-project/ray

rueian · 2026-01-21T23:40:50Z

Description

Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision.

Multiple users have encountered failures caused by this short timeout, for example:

https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559

https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM

https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM

This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

gemini-code-assist

Code Review

This pull request increases the allocate_status_timeout_s in Autoscaler v2 from 5 minutes to 1 hour. This is a necessary change to support rare instance types like GPUs and TPUs that can have long provisioning times. The change is simple and correct. I've added one comment to improve the readability of the new timeout value and to suggest clarifying the associated comment, which appears to be slightly misleading about what this timeout covers. Overall, this is a good improvement for the robustness of the autoscaler.

python/ray/autoscaler/v2/instance_manager/config.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com>

ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 400Ping <jiekaichang@apache.org>

ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[core][autoscaler] Extend instance allocation timeout in autoscaler v2

0488d57

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

rueian requested a review from a team as a code owner January 21, 2026 23:40

rueian added core-autoscaler autoscaler related issues core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Jan 21, 2026

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

python/ray/autoscaler/v2/instance_manager/config.py Outdated Show resolved Hide resolved

rueian and others added 2 commits January 21, 2026 15:46

Update python/ray/autoscaler/v2/instance_manager/config.py

e3a9211

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com>

Merge branch 'master' into extend-autoscaler-default-allocation-timeout

01c1849

edoakes approved these changes Jan 22, 2026

View reviewed changes

edoakes merged commit 542fd29 into ray-project:master Jan 22, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][autoscaler] Extend instance allocation timeout in autoscaler v2#60392

[core][autoscaler] Extend instance allocation timeout in autoscaler v2#60392
edoakes merged 3 commits intoray-project:masterfrom
rueian:extend-autoscaler-default-allocation-timeout

rueian commented Jan 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rueian commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rueian commented Jan 21, 2026 •

edited

Loading