[core][autoscaler] fix: Invalid status transition from QUEUED to RAY_STOP_REQUESTED in autoscaler v2#59550
Merged
edoakes merged 1 commit intoray-project:masterfrom Dec 22, 2025
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug where QUEUED instances were being transitioned to an invalid state (RAY_STOP_REQUESTED) upon a termination request. The introduction of the QUEUED -> TERMINATED state transition is a logical and clean solution for canceling instance allocations before any cloud resources are provisioned. The changes are well-implemented across the state machine definition, the reconciler logic, and the protobuf comments. The addition of a dedicated test case ensures the new behavior is verified and prevents future regressions. Overall, this is a high-quality contribution that improves the robustness of the autoscaler.
…STOP_REQUESTED in autoscaler v2 Signed-off-by: win5923 <ken89@kimo.com>
cb7fcfc to
44c7937
Compare
Contributor
|
@edoakes, please take a look and merge when you get a chance. |
AYou0207
pushed a commit
to AYou0207/ray
that referenced
this pull request
Jan 13, 2026
…STOP_REQUESTED in autoscaler v2 (ray-project#59550) ## Description When the autoscaler attempts to terminate QUEUED instances to enforce the `max_num_nodes_per_type` limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state. The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid. https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197 This occurs when `max_workers` configuration is dynamically reduced or when instances exceed the limit. ``` 2025-12-04 06:21:55,298 INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached). 2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED ``` This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling queued instances. ## Related issues Closes ray-project#59219 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: win5923 <ken89@kimo.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
edoakes
pushed a commit
that referenced
this pull request
Jan 23, 2026
…to RAY_STOP_REQUESTED (#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to #59219 and #59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
xinyuangui2
pushed a commit
to xinyuangui2/ray
that referenced
this pull request
Jan 26, 2026
…to RAY_STOP_REQUESTED (ray-project#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jinbum-kim
pushed a commit
to jinbum-kim/ray
that referenced
this pull request
Jan 29, 2026
…to RAY_STOP_REQUESTED (ray-project#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping
pushed a commit
to 400Ping/ray
that referenced
this pull request
Feb 1, 2026
…to RAY_STOP_REQUESTED (ray-project#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
lee1258561
pushed a commit
to pinterest/ray
that referenced
this pull request
Feb 3, 2026
…STOP_REQUESTED in autoscaler v2 (ray-project#59550) ## Description When the autoscaler attempts to terminate QUEUED instances to enforce the `max_num_nodes_per_type` limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state. The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid. https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197 This occurs when `max_workers` configuration is dynamically reduced or when instances exceed the limit. ``` 2025-12-04 06:21:55,298 INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached). 2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED ``` This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling queued instances. ## Related issues Closes ray-project#59219 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: win5923 <ken89@kimo.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…STOP_REQUESTED in autoscaler v2 (ray-project#59550) ## Description When the autoscaler attempts to terminate QUEUED instances to enforce the `max_num_nodes_per_type` limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state. The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid. https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197 This occurs when `max_workers` configuration is dynamically reduced or when instances exceed the limit. ``` 2025-12-04 06:21:55,298 INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached). 2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220 2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED ``` This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling queued instances. ## Related issues Closes ray-project#59219 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: win5923 <ken89@kimo.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…to RAY_STOP_REQUESTED (ray-project#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…to RAY_STOP_REQUESTED (ray-project#60412) ## Description When the autoscaler attempts to terminate instances in `RAY_INSTALLING` state (e.g., due to `max_num_nodes_per_type` limits being reduced), it crashes with an assertion error: ``` AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` <details> <summary>Full stacktrace</summary> ``` 2026-01-22 17:49:33,055 ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED Traceback (most recent call last): File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state return Reconciler.reconcile( ^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile Reconciler._step_next( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next Reconciler._scale_cluster( File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster Reconciler._update_instance_manager(instance_manager, version, updates) File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager reply = instance_manager.update_instance_manager_state( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state instance = self._update_instance( ^^^^^^^^^^^^^^^^^^^^^^ File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance assert InstanceUtil.set_status(instance, update.new_instance_status), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED ``` </details> The reconciler incorrectly tries to transition `RAY_INSTALLING` to `RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray isn't running yet, so there's nothing to stop/drain. This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`. Both states have cloud instances allocated but Ray not yet running, so they should transition directly to `TERMINATING`. ## Related issues Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED` instances) ## Additional information The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`, `RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`. `RAY_STOP_REQUESTED` is not in this set. Signed-off-by: Johanna Reiml <johanna@reiml.dev> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When the autoscaler attempts to terminate QUEUED instances to enforce the
max_num_nodes_per_typelimit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state.The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid.
ray/python/ray/autoscaler/v2/instance_manager/reconciler.py
Lines 1178 to 1197 in ba727da
This occurs when
max_workersconfiguration is dynamically reduced or when instances exceed the limit.This PR add a valid transition
QUEUED -> TERMINATEDto allow canceling queued instances.Related issues
Closes #59219
Additional information