[train] Make worker group start and poll async#54181
Closed
TimothySeah wants to merge 1 commit intoray-project:masterfrom
Closed
[train] Make worker group start and poll async#54181TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah wants to merge 1 commit intoray-project:masterfrom
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
matthewdeng
pushed a commit
that referenced
this pull request
Sep 24, 2025
…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Sep 27, 2025
…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman
pushed a commit
to dstrodtman/ray
that referenced
this pull request
Oct 6, 2025
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Before this PR, worker group
startandpollwere using blockingray.getandray.waitcalls. This meant if you aborted (#53600) a ray train run that was starting (this can take many seconds) or polling (this takes under a second) a worker group, you would need to wait until the start or poll finished before the abort actually started. Now that we are usingasyncio, we can abort in the middle of these operations.Implementation Notes
Note that the ray train controller still isn't fully async e.g. framework-specific
on_startmethods like https://github.com/ray-project/ray/blob/master/python/ray/train/torch/config.py#L167 are still blocking. I can make the remaining operations async in a future PR.Testing
NOT FULLY WORKING: got some exceptions: https://gist.github.com/TimothySeah/34e5ac5eec4619b413582d29ad804163
Closing this for now since it's difficult to fix the case below. Basically the issue is that if we abort while awaiting on pg.ready (https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L259), that happens
This is more trouble than it's worth to fix because we need to clean up an in progress placement group.