[train] Add a placement group cleaner for Ray Train#58515
Merged
justinvyu merged 25 commits intoray-project:masterfrom Dec 1, 2025
Merged
[train] Add a placement group cleaner for Ray Train#58515justinvyu merged 25 commits intoray-project:masterfrom
justinvyu merged 25 commits intoray-project:masterfrom
Conversation
TimothySeah
reviewed
Nov 12, 2025
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
justinvyu
reviewed
Nov 19, 2025
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
TimothySeah
approved these changes
Nov 20, 2025
Contributor
TimothySeah
left a comment
There was a problem hiding this comment.
Nice, this looks great! It might also be worth testing this by running a bunch of train runs in a workspace and verifying that we don't end up with a bunch of detached actors as pointed out in https://github.com/ray-project/ray/pull/58515/files#r2543599519
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
justinvyu
reviewed
Nov 21, 2025
Contributor
justinvyu
left a comment
There was a problem hiding this comment.
Can you also copy paste the a sample train run's output in the PR description? Want to make sure our new logs do not spam the driver logs.
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
justinvyu
reviewed
Nov 26, 2025
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/callbacks/placement_group_callback.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
justinvyu
approved these changes
Nov 30, 2025
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/placement_group_cleaner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
1. Previously, the placement group lifetime is tied to the Ray job driver, which means if we use Tune + Train V2 or Train V2 with Async validation where validation task creates its own placement group, those placement group owned by non-main job driver will sticks around for the rest of the main job driver. 2. Why did Train v1 + Tune not run into this issue? Tune’s driver process kept track of the placement groups spawned for children, including Train. So the Tune driver process was able to remove the placement group after stopping the trial. If the Tune driver was launched in a remote task and was killed, you’d run into the same issue as long as the job driver was still alive. 3. To resolve this, we proposed to add a placement group cleaner runs as a detached actor together with Ray Train controller through ControllerCallback and WorkerGroupCallback. This cleaner will monitor the liveness of the controller, and if controller dies without exit gracefully, cleans up the PG this controller spawns. 4. Now the flow will look like below: a. after controller start, pg cleaner registered with controller id b. after worker group start and pg created, pg cleaner registered with pg c. pg cleaner runs the monitor loop, if controller is not alive, try to clean up the pg --------- Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Previously, the placement group lifetime is tied to the Ray job driver, which means if we use Tune + Train V2 or Train V2 with Async validation where validation task creates its own placement group, those placement group owned by non-main job driver will sticks around for the rest of the main job driver.
Why did Train v1 + Tune not run into this issue?
Tune’s driver process kept track of the placement groups spawned for children, including Train. So the Tune driver process was able to remove the placement group after stopping the trial.
If the Tune driver was launched in a remote task and was killed, you’d run into the same issue as long as the job driver was still alive.
To resolve this, we proposed to add a placement group cleaner runs as a detached actor together with Ray Train controller through ControllerCallback and WorkerGroupCallback. This cleaner will monitor the liveness of the controller, and if controller dies without exit gracefully, cleans up the PG this controller spawns.
Now the flow will look like below:
a. after controller start, pg cleaner registered with controller id
b. after worker group start and pg created, pg cleaner registered with pg
c. pg cleaner runs the monitor loop, if controller is not alive, try to clean up the pg
Related issues
#54305 #53921
Additional information
Repro: https://gist.github.com/liulehui/4b8ddb074f8db338cb5b331bcee0fd09, see logs in the gist comments
local test for pg cleaner shutdown: