[train] Increase worker group start default timeout to 60s by liulehui · Pull Request #60376 · ray-project/ray

liulehui · 2026-01-21T18:06:44Z

Description

ray.get(pg_handle.ready(), timeout=self._worker_group_start_timeout_s) includes both start placement group and install runtime env, if the installation takes longer than 30s, it will go into a scheduling/rescheduling phase

2025-10-28 14:49:48.318	Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1}] * 8
I
2025-10-28 14:50:18.369	[FailurePolicy] Decision: FailureDecision.RETRY, Error source: controller, Error count / maximum errors allowed: 1/inf, Error: Training failed due to controller error:
The worker group startup timed out after 30.0 seconds waiting for 8 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
I
2025-10-28 14:50:18.370	[State Transition] SCHEDULING -> RESCHEDULING.
I
2025-10-28 14:50:18.370	[State Transition] RESCHEDULING -> SCHEDULING.
I
2025-10-28 14:50:18.374	Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1}] * 8
I
2025-10-28 14:50:48.379	[FailurePolicy] Decision: FailureDecision.RETRY, Error source: controller, Error count / maximum errors allowed: 2/inf, Error: Training failed due to controller error:
The worker group startup timed out after 30.0 seconds waiting for 8 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
I
2025-10-28 14:50:48.379	[State Transition] SCHEDULING -> RESCHEDULING.
I
2025-10-28 14:50:48.379	[State Transition] RESCHEDULING -> SCHEDULING.
I
2025-10-28 14:50:48.382	Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1}] * 8
I
2025-10-28 14:51:18.387	[FailurePolicy] Decision: FailureDecision.RETRY, Error source: controller, Error count / maximum errors allowed: 3/inf, Error: Training failed due to controller error:
The worker group startup timed out after 30.0 seconds waiting for 8 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
I
2025-10-28 14:51:18.387	[State Transition] SCHEDULING -> RESCHEDULING.
I

this change is to change the default timeout to 60s instead to mitigate the fixedScalingPolicy experience.

Signed-off-by: Lehui Liu <lehui@anyscale.com>

gemini-code-assist

Code Review

This pull request increases the default timeout for worker group startup from 30 seconds to 60 seconds. This change is a practical adjustment to mitigate issues where the installation of runtime environments might exceed the previous timeout, causing premature rescheduling. The increased timeout should improve the reliability of worker group initialization, especially in environments with slower runtime environment setup.

justinvyu

Thanks!

…ct#60376) 1. `ray.get(pg_handle.ready(), timeout=self._worker_group_start_timeout_s)` includes both start placement group and install runtime env, if the installation takes longer than 30s, it will go into a scheduling/rescheduling phase 2. this change is to change the default timeout to 60s instead to mitigate the fixedScalingPolicy experience when packages are installed via runtime environment. Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

…ct#60376) 1. `ray.get(pg_handle.ready(), timeout=self._worker_group_start_timeout_s)` includes both start placement group and install runtime env, if the installation takes longer than 30s, it will go into a scheduling/rescheduling phase 2. this change is to change the default timeout to 60s instead to mitigate the fixedScalingPolicy experience when packages are installed via runtime environment. Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>

…ct#60376) 1. `ray.get(pg_handle.ready(), timeout=self._worker_group_start_timeout_s)` includes both start placement group and install runtime env, if the installation takes longer than 30s, it will go into a scheduling/rescheduling phase 2. this change is to change the default timeout to 60s instead to mitigate the fixedScalingPolicy experience when packages are installed via runtime environment. Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

increase wg start default timeout to 60s

61ee57e

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui requested a review from a team as a code owner January 21, 2026 18:06

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

ray-gardener bot added the train Ray Train Related Issue label Jan 21, 2026

Merge branch 'master' into mitigate-scheduling-rescheduling

317b343

liulehui added the go add ONLY when ready to merge, run all tests label Jan 21, 2026

justinvyu approved these changes Jan 22, 2026

View reviewed changes

justinvyu merged commit 63afcc6 into ray-project:master Jan 22, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Increase worker group start default timeout to 60s#60376

[train] Increase worker group start default timeout to 60s#60376
justinvyu merged 2 commits intoray-project:masterfrom
liulehui:mitigate-scheduling-rescheduling

liulehui commented Jan 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liulehui commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liulehui commented Jan 21, 2026 •

edited

Loading