Skip to content

[train] Make worker group start and poll async#54181

Closed
TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah:tseah/controller-fully-async
Closed

[train] Make worker group start and poll async#54181
TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah:tseah/controller-fully-async

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Jun 27, 2025

Summary

Before this PR, worker group start and poll were using blocking ray.get and ray.wait calls. This meant if you aborted (#53600) a ray train run that was starting (this can take many seconds) or polling (this takes under a second) a worker group, you would need to wait until the start or poll finished before the abort actually started. Now that we are using asyncio, we can abort in the middle of these operations.

Implementation Notes

Note that the ray train controller still isn't fully async e.g. framework-specific on_start methods like https://github.com/ray-project/ray/blob/master/python/ray/train/torch/config.py#L167 are still blocking. I can make the remaining operations async in a future PR.

Testing

NOT FULLY WORKING: got some exceptions: https://gist.github.com/TimothySeah/34e5ac5eec4619b413582d29ad804163

Closing this for now since it's difficult to fix the case below. Basically the issue is that if we abort while awaiting on pg.ready (https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L259), that happens

This is more trouble than it's worth to fix because we need to clean up an in progress placement group.

Screenshot 2025-06-27 at 1 21 41 PM

Signed-off-by: Timothy Seah <tseah@anyscale.com>
matthewdeng pushed a commit that referenced this pull request Sep 24, 2025
…56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant