[train] Make worker group start and poll async by TimothySeah · Pull Request #54181 · ray-project/ray

TimothySeah · 2025-06-27T19:10:05Z

Summary

Before this PR, worker group start and poll were using blocking ray.get and ray.wait calls. This meant if you aborted (#53600) a ray train run that was starting (this can take many seconds) or polling (this takes under a second) a worker group, you would need to wait until the start or poll finished before the abort actually started. Now that we are using asyncio, we can abort in the middle of these operations.

Implementation Notes

Note that the ray train controller still isn't fully async e.g. framework-specific on_start methods like https://github.com/ray-project/ray/blob/master/python/ray/train/torch/config.py#L167 are still blocking. I can make the remaining operations async in a future PR.

Testing

NOT FULLY WORKING: got some exceptions: https://gist.github.com/TimothySeah/34e5ac5eec4619b413582d29ad804163

Closing this for now since it's difficult to fix the case below. Basically the issue is that if we abort while awaiting on pg.ready (https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L259), that happens

after before_worker_group_start (https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L240), which creates the train run attempt and sets it to pending
before creating the worker group (https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/controller/controller.py#L293), whose existence the controller checks before aborting

This is more trouble than it's worth to fix because we need to clean up an in progress placement group.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

[train] Make worker group start and polling async

9e29547

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah closed this Jun 27, 2025

TimothySeah mentioned this pull request Sep 23, 2025

[core][train] Ray Train disables blocking get inside async warning #56757

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Make worker group start and poll async#54181

[train] Make worker group start and poll async#54181
TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah:tseah/controller-fully-async

TimothySeah commented Jun 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimothySeah commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TimothySeah commented Jun 27, 2025 •

edited

Loading