[Ray Tune + Train V2] Broken resource sharing among Tune trials

### What happened + What you expected to happen

It seems that resources are not effectively released when Train trial using Train API V2 is stopped by a Tune stopping condition (using  the `stop` parameter of `ray.tune.RunConfig`).

The reproduction script below:
- initializes a local Ray cluster using 2 CPUs.
- generates 2 tune trial
- uses a maximum trial concurrency of 1
- requests 1 CPU per trial
- reports 2 checkpoints from each trial before exiting the train function 
- finishes 1st trial and deadlocks on 2nd trial trying to allocate resources for a training worker group

At deadlock, this output keeps repeating:

```
(train_fn pid=43541) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
Trial status: 1 TERMINATED | 1 RUNNING
Current time: 2025-07-03 04:30:54. Total running time: 1min 30s
Logical resource usage: 1.0/2 CPUs, 0/0 GPUs
╭───────────────────────────────────────────────────────────────────────────╮
│ Trial name             status         p     iter     total time (s)     m │
├───────────────────────────────────────────────────────────────────────────┤
│ train_fn_4c693_00001   RUNNING        2                                   │
│ train_fn_4c693_00000   TERMINATED     1        1            5.78526     0 │
╰───────────────────────────────────────────────────────────────────────────╯
```

What is surprising is that Tune does report only 1 out of 2 CPUs used and yet it can not start a worker group requiring only 1 more CPU.

This is what `ray status` reports at the point when the job is deadlocked:

```
Node status
---------------------------------------------------------------
Active:
 1 node_70115b6bc63e16248c8fc84845a7c9e71dd1a792ed88b774ec81593f
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 1.0/2.0 CPU (1.0 used of 2.0 reserved in placement groups)
 0B/114.77GiB memory
 1.63KiB/9.31GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 {'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
```


However, the second trial does start and complete if I do one of the following:
- initialize the Ray cluster with 3 CPUs instead of just 2
- set `max_iter` to something > 2 in `MaximumIterationStopper`, thus allowing the trial to finish by itself

Likewise, if I request 2 CPUs per trial, then I need to initialize the cluster with 5 CPUs for both trials to start and complete. So, generally, it seems that `1 + num_trials * cpu_per_trial` CPUs is needed, even though the trials are supposed to run sequentially, not in parallel.

Moreover, if I remove the `max_concurrent_trials=1` limitation and have more trials than supported by cluster resources at once, then no trials ever start, even if the cluster resources support each individual trial.

Also, no such issue is observed using the Train V1 interface - passing the TorchTrainer object directly to the Tuner.

### Versions / Dependencies

Ray: 2.47.1
Python: 3.12.9

### Reproduction script

```
import ray.train
import ray.tune
import ray.tune.integration.ray_train
import ray.tune.stopper
import ray.train.torch
from tempfile import TemporaryDirectory


def train_worker_fn():
    with TemporaryDirectory() as tmpdir:
        c = ray.train.Checkpoint(tmpdir)
        for i in range(2):
            ray.train.report(metrics={'m': i}, checkpoint=c)

def train_fn(config):
    print(f"Running trial {ray.tune.get_context().get_trial_id()}")

    trainer = ray.train.torch.TorchTrainer(
        train_worker_fn,
        scaling_config=ray.train.ScalingConfig(
            num_workers=1
        ),
        run_config=ray.train.RunConfig(
            callbacks=[ray.tune.integration.ray_train.TuneReportCallback()],
        )
    )
    trainer.fit()


def main():
    tuner = ray.tune.Tuner(
        train_fn,
        param_space={'p': ray.tune.grid_search([1, 2])},
        tune_config=ray.tune.TuneConfig(max_concurrent_trials=1),
        run_config=ray.tune.RunConfig(
            stop=ray.tune.stopper.MaximumIterationStopper(
                max_iter=1,
            )
        )
    )
    tuner.fit()

ray.init(num_cpus=2)
main()
```

### Issue Severity

High: It blocks me from using Ray Tune with Ray Train V2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Tune + Train V2] Broken resource sharing among Tune trials #54305

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Ray Tune + Train V2] Broken resource sharing among Tune trials #54305

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions