Skip to content

[train] Add a placement group cleaner for Ray Train#58515

Merged
justinvyu merged 25 commits intoray-project:masterfrom
liulehui:pg-cleanup
Dec 1, 2025
Merged

[train] Add a placement group cleaner for Ray Train#58515
justinvyu merged 25 commits intoray-project:masterfrom
liulehui:pg-cleanup

Conversation

@liulehui
Copy link
Contributor

@liulehui liulehui commented Nov 10, 2025

Description

  1. Previously, the placement group lifetime is tied to the Ray job driver, which means if we use Tune + Train V2 or Train V2 with Async validation where validation task creates its own placement group, those placement group owned by non-main job driver will sticks around for the rest of the main job driver.

  2. Why did Train v1 + Tune not run into this issue?
    Tune’s driver process kept track of the placement groups spawned for children, including Train. So the Tune driver process was able to remove the placement group after stopping the trial.
    If the Tune driver was launched in a remote task and was killed, you’d run into the same issue as long as the job driver was still alive.

  3. To resolve this, we proposed to add a placement group cleaner runs as a detached actor together with Ray Train controller through ControllerCallback and WorkerGroupCallback. This cleaner will monitor the liveness of the controller, and if controller dies without exit gracefully, cleans up the PG this controller spawns.

  4. Now the flow will look like below:
    a. after controller start, pg cleaner registered with controller id
    b. after worker group start and pg created, pg cleaner registered with pg
    c. pg cleaner runs the monitor loop, if controller is not alive, try to clean up the pg

Related issues

#54305 #53921

Additional information

  1. Repro: https://gist.github.com/liulehui/4b8ddb074f8db338cb5b331bcee0fd09, see logs in the gist comments

  2. local test for pg cleaner shutdown:

image
  1. vanilla train run log after this change see below:
2025-11-23 22:20:35,483	INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(TrainController pid=94660) A run snapshot was found in storage folder at: '/Users/lehui/ray_results/train_with_failure'
(TrainController pid=94660) This snapshot contains a list of checkpoints reported via `ray.train.report` and will be loaded. This allows the latest checkpoint found in the snapshot to be accessible within your training function via `ray.train.get_checkpoint`.
(TrainController pid=94660) If you meant to start a brand new training job without any information about previous checkpoints found in this directory, please configure a new, unique `RunConfig(name)` or delete the existing folder at '/Users/lehui/ray_results/train_with_failure'.
(TrainController pid=94660) Attempting to start training worker group of size 2 with the following resources: [{'CPU': 1}] * 2
(TrainController pid=94660) Started training worker group of size 2: 
(TrainController pid=94660) - (ip=127.0.0.1, pid=94666) world_rank=0, local_rank=0, node_rank=0
(TrainController pid=94660) - (ip=127.0.0.1, pid=94667) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=94666) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/lehui/ray_results/train_with_failure/checkpoint_2025-11-23_22-20-38.349831)
(RayTrainWorker pid=94666) Reporting training result 1: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/Users/lehui/ray_results/train_with_failure/checkpoint_2025-11-23_22-20-38.349831), metrics={'m': 0}, validation_spec=None)
(RayTrainWorker pid=94666) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/lehui/ray_results/train_with_failure/checkpoint_2025-11-23_22-20-38.353886)
(RayTrainWorker pid=94666) Reporting training result 2: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/Users/lehui/ray_results/train_with_failure/checkpoint_2025-11-23_22-20-38.353886), metrics={'m': 1}, validation_spec=None)

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui changed the title a pg reaper [train] Add a placement group cleaner for Ray Train Nov 17, 2025
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui added the go add ONLY when ready to merge, run all tests label Nov 17, 2025
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui marked this pull request as ready for review November 18, 2025 00:44
@liulehui liulehui requested a review from a team as a code owner November 18, 2025 00:44
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Nov 18, 2025
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Copy link
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks great! It might also be worth testing this by running a bunch of train runs in a workspace and verifying that we don't end up with a bunch of detached actors as pointed out in https://github.com/ray-project/ray/pull/58515/files#r2543599519

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also copy paste the a sample train run's output in the PR description? Want to make sure our new logs do not spam the driver logs.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@justinvyu justinvyu merged commit 991bdd3 into ray-project:master Dec 1, 2025
6 checks passed
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
1. Previously, the placement group lifetime is tied to the Ray job
driver, which means if we use Tune + Train V2 or Train V2 with Async
validation where validation task creates its own placement group, those
placement group owned by non-main job driver will sticks around for the
rest of the main job driver.
2. Why did Train v1 + Tune not run into this issue?
Tune’s driver process kept track of the placement groups spawned for
children, including Train. So the Tune driver process was able to remove
the placement group after stopping the trial.
If the Tune driver was launched in a remote task and was killed, you’d
run into the same issue as long as the job driver was still alive.

3. To resolve this, we proposed to add a placement group cleaner runs as
a detached actor together with Ray Train controller through
ControllerCallback and WorkerGroupCallback. This cleaner will monitor
the liveness of the controller, and if controller dies without exit
gracefully, cleans up the PG this controller spawns.
4. Now the flow will look like below:
a. after controller start, pg cleaner registered with controller id
b. after worker group start and pg created, pg cleaner registered with
pg
c. pg cleaner runs the monitor loop, if controller is not alive, try to
clean up the pg

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants