-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Description
What happened + What you expected to happen
It seems that resources are not effectively released when Train trial using Train API V2 is stopped by a Tune stopping condition (using the stop parameter of ray.tune.RunConfig).
The reproduction script below:
- initializes a local Ray cluster using 2 CPUs.
- generates 2 tune trial
- uses a maximum trial concurrency of 1
- requests 1 CPU per trial
- reports 2 checkpoints from each trial before exiting the train function
- finishes 1st trial and deadlocks on 2nd trial trying to allocate resources for a training worker group
At deadlock, this output keeps repeating:
(train_fn pid=43541) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
Trial status: 1 TERMINATED | 1 RUNNING
Current time: 2025-07-03 04:30:54. Total running time: 1min 30s
Logical resource usage: 1.0/2 CPUs, 0/0 GPUs
╭───────────────────────────────────────────────────────────────────────────╮
│ Trial name status p iter total time (s) m │
├───────────────────────────────────────────────────────────────────────────┤
│ train_fn_4c693_00001 RUNNING 2 │
│ train_fn_4c693_00000 TERMINATED 1 1 5.78526 0 │
╰───────────────────────────────────────────────────────────────────────────╯
What is surprising is that Tune does report only 1 out of 2 CPUs used and yet it can not start a worker group requiring only 1 more CPU.
This is what ray status reports at the point when the job is deadlocked:
Node status
---------------------------------------------------------------
Active:
1 node_70115b6bc63e16248c8fc84845a7c9e71dd1a792ed88b774ec81593f
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
1.0/2.0 CPU (1.0 used of 2.0 reserved in placement groups)
0B/114.77GiB memory
1.63KiB/9.31GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
{'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
However, the second trial does start and complete if I do one of the following:
- initialize the Ray cluster with 3 CPUs instead of just 2
- set
max_iterto something > 2 inMaximumIterationStopper, thus allowing the trial to finish by itself
Likewise, if I request 2 CPUs per trial, then I need to initialize the cluster with 5 CPUs for both trials to start and complete. So, generally, it seems that 1 + num_trials * cpu_per_trial CPUs is needed, even though the trials are supposed to run sequentially, not in parallel.
Moreover, if I remove the max_concurrent_trials=1 limitation and have more trials than supported by cluster resources at once, then no trials ever start, even if the cluster resources support each individual trial.
Also, no such issue is observed using the Train V1 interface - passing the TorchTrainer object directly to the Tuner.
Versions / Dependencies
Ray: 2.47.1
Python: 3.12.9
Reproduction script
import ray.train
import ray.tune
import ray.tune.integration.ray_train
import ray.tune.stopper
import ray.train.torch
from tempfile import TemporaryDirectory
def train_worker_fn():
with TemporaryDirectory() as tmpdir:
c = ray.train.Checkpoint(tmpdir)
for i in range(2):
ray.train.report(metrics={'m': i}, checkpoint=c)
def train_fn(config):
print(f"Running trial {ray.tune.get_context().get_trial_id()}")
trainer = ray.train.torch.TorchTrainer(
train_worker_fn,
scaling_config=ray.train.ScalingConfig(
num_workers=1
),
run_config=ray.train.RunConfig(
callbacks=[ray.tune.integration.ray_train.TuneReportCallback()],
)
)
trainer.fit()
def main():
tuner = ray.tune.Tuner(
train_fn,
param_space={'p': ray.tune.grid_search([1, 2])},
tune_config=ray.tune.TuneConfig(max_concurrent_trials=1),
run_config=ray.tune.RunConfig(
stop=ray.tune.stopper.MaximumIterationStopper(
max_iter=1,
)
)
)
tuner.fit()
ray.init(num_cpus=2)
main()
Issue Severity
High: It blocks me from using Ray Tune with Ray Train V2