Remove Placement Group on Train Run Abort#56011
Remove Placement Group on Train Run Abort#56011justinvyu merged 9 commits intoray-project:masterfrom
Conversation
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a resource leak by ensuring placement groups are removed when a train run is aborted. The change adds a call to _worker_group_state.shutdown() in the abort method, which is the right approach.
My review includes two main points for improvement:
- After aborting, the
WorkerGroupobject is left in an inconsistent state. I've suggested adding a call to_clear_state()to resolve this. - A
TODOcomment has become outdated due to the change and should be updated for clarity and maintainability.
Overall, the change is in the right direction to fix the bug, and with these adjustments, it will be more robust.
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
…oup.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
justinvyu
left a comment
There was a problem hiding this comment.
Thanks, can you add to the ABORTED test here: https://github.com/anyscale/rayturbo/blob/master/python/ray/train/v2/tests/test_worker_group.py#L512
| # TODO: consider shutting down the workers in the future. | ||
| # We don't do this for now due to this risk of hanging e.g. when calling | ||
| # `destroy_process_group` on an active group. | ||
| # `destroy_process_group` on an active group. A solution is to use a timeout | ||
| # in TorchConfig.on_shutdown in case of a hang. |
There was a problem hiding this comment.
"TODO: add shutdown callback hooks"
| # TODO: consider shutting down the workers in the future. | ||
| # We don't do this for now due to this risk of hanging e.g. when calling | ||
| # `destroy_process_group` on an active group. | ||
| # `destroy_process_group` on an active group. A solution is to use a timeout |
There was a problem hiding this comment.
Wait I think consider shutting down the workers in the future. is no longer applicable because worker_group_state.shutdown does do that right? Do we need to fix the destroy_process_group on an active group issue in this PR too?
There was a problem hiding this comment.
Ah thanks for the catch! I will remove the comment regarding the worker shutdown. As for the destroy_process_goup on an active group that is not triggered unless we also perform the before_worker_group_abort callbacks so it will not be included in this PR
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
| # TODO: Add a timeout in the case of a hang, particularly | ||
| # relevant when func is TorchConfig.on_shutdown |
There was a problem hiding this comment.
we can remove this, Tim added the timeout at a different layer
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: sampan <sampan@anyscale.com>
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook.