Skip to content

[train] Increase worker group start default timeout to 60s#60376

Merged
justinvyu merged 2 commits intoray-project:masterfrom
liulehui:mitigate-scheduling-rescheduling
Jan 22, 2026
Merged

[train] Increase worker group start default timeout to 60s#60376
justinvyu merged 2 commits intoray-project:masterfrom
liulehui:mitigate-scheduling-rescheduling

Conversation

@liulehui
Copy link
Contributor

@liulehui liulehui commented Jan 21, 2026

Description

  1. ray.get(pg_handle.ready(), timeout=self._worker_group_start_timeout_s) includes both start placement group and install runtime env, if the installation takes longer than 30s, it will go into a scheduling/rescheduling phase
2025-10-28 14:49:48.318	Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1}] * 8
I
2025-10-28 14:50:18.369	[FailurePolicy] Decision: FailureDecision.RETRY, Error source: controller, Error count / maximum errors allowed: 1/inf, Error: Training failed due to controller error:
The worker group startup timed out after 30.0 seconds waiting for 8 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
I
2025-10-28 14:50:18.370	[State Transition] SCHEDULING -> RESCHEDULING.
I
2025-10-28 14:50:18.370	[State Transition] RESCHEDULING -> SCHEDULING.
I
2025-10-28 14:50:18.374	Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1}] * 8
I
2025-10-28 14:50:48.379	[FailurePolicy] Decision: FailureDecision.RETRY, Error source: controller, Error count / maximum errors allowed: 2/inf, Error: Training failed due to controller error:
The worker group startup timed out after 30.0 seconds waiting for 8 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
I
2025-10-28 14:50:48.379	[State Transition] SCHEDULING -> RESCHEDULING.
I
2025-10-28 14:50:48.379	[State Transition] RESCHEDULING -> SCHEDULING.
I
2025-10-28 14:50:48.382	Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1}] * 8
I
2025-10-28 14:51:18.387	[FailurePolicy] Decision: FailureDecision.RETRY, Error source: controller, Error count / maximum errors allowed: 3/inf, Error: Training failed due to controller error:
The worker group startup timed out after 30.0 seconds waiting for 8 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
I
2025-10-28 14:51:18.387	[State Transition] SCHEDULING -> RESCHEDULING.
I

  1. this change is to change the default timeout to 60s instead to mitigate the fixedScalingPolicy experience.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui requested a review from a team as a code owner January 21, 2026 18:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the default timeout for worker group startup from 30 seconds to 60 seconds. This change is a practical adjustment to mitigate issues where the installation of runtime environments might exceed the previous timeout, causing premature rescheduling. The increased timeout should improve the reliability of worker group initialization, especially in environments with slower runtime environment setup.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Jan 21, 2026
@liulehui liulehui added the go add ONLY when ready to merge, run all tests label Jan 21, 2026
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu merged commit 63afcc6 into ray-project:master Jan 22, 2026
7 checks passed
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…ct#60376)

1. `ray.get(pg_handle.ready(),
timeout=self._worker_group_start_timeout_s)` includes both start
placement group and install runtime env, if the installation takes
longer than 30s, it will go into a scheduling/rescheduling phase
2. this change is to change the default timeout to 60s instead to
mitigate the fixedScalingPolicy experience when packages are
installed via runtime environment.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…ct#60376)

1. `ray.get(pg_handle.ready(),
timeout=self._worker_group_start_timeout_s)` includes both start
placement group and install runtime env, if the installation takes
longer than 30s, it will go into a scheduling/rescheduling phase
2. this change is to change the default timeout to 60s instead to
mitigate the fixedScalingPolicy experience when packages are
installed via runtime environment.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#60376)

1. `ray.get(pg_handle.ready(),
timeout=self._worker_group_start_timeout_s)` includes both start
placement group and install runtime env, if the installation takes
longer than 30s, it will go into a scheduling/rescheduling phase
2. this change is to change the default timeout to 60s instead to
mitigate the fixedScalingPolicy experience when packages are
installed via runtime environment.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#60376)

1. `ray.get(pg_handle.ready(),
timeout=self._worker_group_start_timeout_s)` includes both start
placement group and install runtime env, if the installation takes
longer than 30s, it will go into a scheduling/rescheduling phase
2. this change is to change the default timeout to 60s instead to
mitigate the fixedScalingPolicy experience when packages are
installed via runtime environment.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants