Skip to content

[Ray Tune + Train V2] Broken resource sharing among Tune trials #54305

@jleben

Description

@jleben

What happened + What you expected to happen

It seems that resources are not effectively released when Train trial using Train API V2 is stopped by a Tune stopping condition (using the stop parameter of ray.tune.RunConfig).

The reproduction script below:

  • initializes a local Ray cluster using 2 CPUs.
  • generates 2 tune trial
  • uses a maximum trial concurrency of 1
  • requests 1 CPU per trial
  • reports 2 checkpoints from each trial before exiting the train function
  • finishes 1st trial and deadlocks on 2nd trial trying to allocate resources for a training worker group

At deadlock, this output keeps repeating:

(train_fn pid=43541) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
Trial status: 1 TERMINATED | 1 RUNNING
Current time: 2025-07-03 04:30:54. Total running time: 1min 30s
Logical resource usage: 1.0/2 CPUs, 0/0 GPUs
╭───────────────────────────────────────────────────────────────────────────╮
│ Trial name             status         p     iter     total time (s)     m │
├───────────────────────────────────────────────────────────────────────────┤
│ train_fn_4c693_00001   RUNNING        2                                   │
│ train_fn_4c693_00000   TERMINATED     1        1            5.78526     0 │
╰───────────────────────────────────────────────────────────────────────────╯

What is surprising is that Tune does report only 1 out of 2 CPUs used and yet it can not start a worker group requiring only 1 more CPU.

This is what ray status reports at the point when the job is deadlocked:

Node status
---------------------------------------------------------------
Active:
 1 node_70115b6bc63e16248c8fc84845a7c9e71dd1a792ed88b774ec81593f
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 1.0/2.0 CPU (1.0 used of 2.0 reserved in placement groups)
 0B/114.77GiB memory
 1.63KiB/9.31GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 {'CPU': 1.0} * 1 (PACK): 1+ pending placement groups

However, the second trial does start and complete if I do one of the following:

  • initialize the Ray cluster with 3 CPUs instead of just 2
  • set max_iter to something > 2 in MaximumIterationStopper, thus allowing the trial to finish by itself

Likewise, if I request 2 CPUs per trial, then I need to initialize the cluster with 5 CPUs for both trials to start and complete. So, generally, it seems that 1 + num_trials * cpu_per_trial CPUs is needed, even though the trials are supposed to run sequentially, not in parallel.

Moreover, if I remove the max_concurrent_trials=1 limitation and have more trials than supported by cluster resources at once, then no trials ever start, even if the cluster resources support each individual trial.

Also, no such issue is observed using the Train V1 interface - passing the TorchTrainer object directly to the Tuner.

Versions / Dependencies

Ray: 2.47.1
Python: 3.12.9

Reproduction script

import ray.train
import ray.tune
import ray.tune.integration.ray_train
import ray.tune.stopper
import ray.train.torch
from tempfile import TemporaryDirectory


def train_worker_fn():
    with TemporaryDirectory() as tmpdir:
        c = ray.train.Checkpoint(tmpdir)
        for i in range(2):
            ray.train.report(metrics={'m': i}, checkpoint=c)

def train_fn(config):
    print(f"Running trial {ray.tune.get_context().get_trial_id()}")

    trainer = ray.train.torch.TorchTrainer(
        train_worker_fn,
        scaling_config=ray.train.ScalingConfig(
            num_workers=1
        ),
        run_config=ray.train.RunConfig(
            callbacks=[ray.tune.integration.ray_train.TuneReportCallback()],
        )
    )
    trainer.fit()


def main():
    tuner = ray.tune.Tuner(
        train_fn,
        param_space={'p': ray.tune.grid_search([1, 2])},
        tune_config=ray.tune.TuneConfig(max_concurrent_trials=1),
        run_config=ray.tune.RunConfig(
            stop=ray.tune.stopper.MaximumIterationStopper(
                max_iter=1,
            )
        )
    )
    tuner.fit()

ray.init(num_cpus=2)
main()

Issue Severity

High: It blocks me from using Ray Tune with Ray Train V2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogstabilitytrainRay Train Related IssuetriageNeeds triage (eg: priority, bug/not-bug, and owning component)tuneTune-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions