[core][autoscaler] Extend instance allocation timeout in autoscaler v2#60392
Merged
edoakes merged 3 commits intoray-project:masterfrom Jan 22, 2026
Merged
Conversation
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request increases the allocate_status_timeout_s in Autoscaler v2 from 5 minutes to 1 hour. This is a necessary change to support rare instance types like GPUs and TPUs that can have long provisioning times. The change is simple and correct. I've added one comment to improve the readability of the new timeout value and to suggest clarifying the associated comment, which appears to be slightly misleading about what this timeout covers. Overall, this is a good improvement for the robustness of the autoscaler.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com>
edoakes
approved these changes
Jan 22, 2026
jinbum-kim
pushed a commit
to jinbum-kim/ray
that referenced
this pull request
Jan 29, 2026
ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping
pushed a commit
to 400Ping/ray
that referenced
this pull request
Feb 1, 2026
ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
ray-project#60392) ## Description Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision. Multiple users have encountered failures caused by this short timeout, for example: https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559 https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Autoscaler v2 uses a 5-minute timeout for a VM instance to be up and running. This timeout is often insufficient for rare instance types, such as GPU and TPU instances, which can take much longer to provision.
Multiple users have encountered failures caused by this short timeout, for example:
https://ray.slack.com/archives/C02GFQ82JPM/p1763416720349559
https://ray.slack.com/archives/C02GFQ82JPM/p1767966803946519?thread_ts=1767348863.118229&cid=C02GFQ82JPM
https://ray.slack.com/archives/C02GFQ82JPM/p1763533559614779?thread_ts=1763488820.585659&cid=C02GFQ82JPM
https://ray.slack.com/archives/C02GFQ82JPM/p1765205118511629?thread_ts=1764860436.694129&cid=C02GFQ82JPM
This PR increases the timeout from 5 minutes to 1 hour, making Autoscaler v2 more robust for slow-provisioning instance types.