Skip to content

[core][autoscaler] fix: Invalid status transition from QUEUED to RAY_STOP_REQUESTED in autoscaler v2#59550

Merged
edoakes merged 1 commit intoray-project:masterfrom
win5923:autoscaler-invalid-transition
Dec 22, 2025
Merged

[core][autoscaler] fix: Invalid status transition from QUEUED to RAY_STOP_REQUESTED in autoscaler v2#59550
edoakes merged 1 commit intoray-project:masterfrom
win5923:autoscaler-invalid-transition

Conversation

@win5923
Copy link
Member

@win5923 win5923 commented Dec 18, 2025

Description

When the autoscaler attempts to terminate QUEUED instances to enforce the max_num_nodes_per_type limit, the reconciler crashes with an assertion error. This happens because QUEUED instances are selected for termination, but the state machine doesn't allow transitioning them to a terminated state.

The reconciler assumes all non-ALLOCATED instances have Ray running and attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid.

for terminate_request in to_terminate:
instance_id = terminate_request.instance_id
if terminate_request.instance_status == IMInstance.ALLOCATED:
# The instance is not yet running, so we can't request to stop/drain Ray.
# Therefore, we can skip the RAY_STOP_REQUESTED state and directly terminate the node.
im_instance_to_terminate = im_instances_by_instance_id[instance_id]
updates[terminate_request.instance_id] = IMInstanceUpdateEvent(
instance_id=instance_id,
new_instance_status=IMInstance.TERMINATING,
cloud_instance_id=im_instance_to_terminate.cloud_instance_id,
termination_request=terminate_request,
details=f"terminating ray: {terminate_request.details}",
)
else:
updates[terminate_request.instance_id] = IMInstanceUpdateEvent(
instance_id=instance_id,
new_instance_status=IMInstance.RAY_STOP_REQUESTED,
termination_request=terminate_request,
details=f"draining ray: {terminate_request.details}",
)

This occurs when max_workers configuration is dynamically reduced or when instances exceed the limit.

2025-12-04 06:21:55,298	INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached).
2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307	INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED

This PR add a valid transition QUEUED -> TERMINATED to allow canceling queued instances.

Related issues

Closes #59219

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@win5923 win5923 requested a review from a team as a code owner December 18, 2025 16:26
@win5923 win5923 marked this pull request as draft December 18, 2025 16:26
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug where QUEUED instances were being transitioned to an invalid state (RAY_STOP_REQUESTED) upon a termination request. The introduction of the QUEUED -> TERMINATED state transition is a logical and clean solution for canceling instance allocations before any cloud resources are provisioned. The changes are well-implemented across the state machine definition, the reconciler logic, and the protobuf comments. The addition of a dedicated test case ensures the new behavior is verified and prevents future regressions. Overall, this is a high-quality contribution that improves the robustness of the autoscaler.

…STOP_REQUESTED in autoscaler v2

Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the autoscaler-invalid-transition branch from cb7fcfc to 44c7937 Compare December 18, 2025 16:34
@win5923 win5923 marked this pull request as ready for review December 18, 2025 16:43
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 18, 2025
@rueian rueian added the go add ONLY when ready to merge, run all tests label Dec 21, 2025
Copy link
Contributor

@rueian rueian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @win5923! Looks good to me!

@rueian
Copy link
Contributor

rueian commented Dec 22, 2025

@edoakes, please take a look and merge when you get a chance.

@edoakes edoakes merged commit 468d76d into ray-project:master Dec 22, 2025
7 checks passed
@win5923 win5923 deleted the autoscaler-invalid-transition branch December 23, 2025 00:42
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…STOP_REQUESTED in autoscaler v2 (ray-project#59550)

## Description

When the autoscaler attempts to terminate QUEUED instances to enforce
the `max_num_nodes_per_type` limit, the reconciler crashes with an
assertion error. This happens because QUEUED instances are selected for
termination, but the state machine doesn't allow transitioning them to a
terminated state.

The reconciler assumes all non-ALLOCATED instances have Ray running and
attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid.

https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197

This occurs when `max_workers` configuration is dynamically reduced or
when instances exceed the limit.

```
2025-12-04 06:21:55,298	INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached).
2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307	INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED
```

This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling
queued instances.

## Related issues
Closes ray-project#59219

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
edoakes pushed a commit that referenced this pull request Jan 23, 2026
…to RAY_STOP_REQUESTED (#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to #59219 and #59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Jan 26, 2026
…to RAY_STOP_REQUESTED (ray-project#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…to RAY_STOP_REQUESTED (ray-project#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…to RAY_STOP_REQUESTED (ray-project#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
…STOP_REQUESTED in autoscaler v2 (ray-project#59550)

## Description

When the autoscaler attempts to terminate QUEUED instances to enforce
the `max_num_nodes_per_type` limit, the reconciler crashes with an
assertion error. This happens because QUEUED instances are selected for
termination, but the state machine doesn't allow transitioning them to a
terminated state.

The reconciler assumes all non-ALLOCATED instances have Ray running and
attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid.


https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197

This occurs when `max_workers` configuration is dynamically reduced or
when instances exceed the limit.

```
2025-12-04 06:21:55,298	INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached).
2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307	INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED
```

This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling
queued instances.

## Related issues
Closes ray-project#59219

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: win5923 <ken89@kimo.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…STOP_REQUESTED in autoscaler v2 (ray-project#59550)

## Description

When the autoscaler attempts to terminate QUEUED instances to enforce
the `max_num_nodes_per_type` limit, the reconciler crashes with an
assertion error. This happens because QUEUED instances are selected for
termination, but the state machine doesn't allow transitioning them to a
terminated state.

The reconciler assumes all non-ALLOCATED instances have Ray running and
attempts to transition QUEUED → RAY_STOP_REQUESTED, which is invalid.

https://github.com/ray-project/ray/blob/ba727da47a1a4af1f58c1642839deb0defd82d7a/python/ray/autoscaler/v2/instance_manager/reconciler.py#L1178-L1197

This occurs when `max_workers` configuration is dynamically reduced or
when instances exceed the limit.

```
2025-12-04 06:21:55,298	INFO event_logger.py:77 -- Removing 167 nodes of type elser-v2-ingest (max number of worker nodes per type reached).
2025-12-04 06:21:55,307 - INFO - Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307	INFO instance_manager.py:263 -- Update instance QUEUED->RAY_STOP_REQUESTED (id=e13a1528-ffd9-403b-9fd1-b54e9c2698a0, type=elser-v2-ingest, cloud_instance_id=, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=220
2025-12-04 06:21:55,307 - ERROR - Invalid status transition from QUEUED to RAY_STOP_REQUESTED
```

This PR add a valid transition `QUEUED -> TERMINATED` to allow canceling
queued instances.

## Related issues
Closes ray-project#59219

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…to RAY_STOP_REQUESTED (ray-project#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…to RAY_STOP_REQUESTED (ray-project#60412)

## Description

When the autoscaler attempts to terminate instances in `RAY_INSTALLING`
state (e.g., due to `max_num_nodes_per_type` limits being reduced), it
crashes with an assertion error:

```
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

<details>
<summary>Full stacktrace</summary>

```
2026-01-22 17:49:33,055    ERROR autoscaler.py:222 -- Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/autoscaler.py", line 206, in update_autoscaling_state
    return Reconciler.reconcile(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 126, in reconcile
    Reconciler._step_next(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._scale_cluster(
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1223, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 635, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/virtualenv/lib/python3.12/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 264, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Invalid status transition from RAY_INSTALLING to RAY_STOP_REQUESTED
```

</details>

The reconciler incorrectly tries to transition `RAY_INSTALLING` to
`RAY_STOP_REQUESTED`, but the state machine doesn't allow this. Ray
isn't running yet, so there's nothing to stop/drain.

This fix adds `RAY_INSTALLING` to the same condition as `ALLOCATED`.
Both states have cloud instances allocated but Ray not yet running, so
they should transition directly to `TERMINATING`.

## Related issues

Related to ray-project#59219 and ray-project#59550 (which fixed the same issue for `QUEUED`
instances)

## Additional information

The valid transitions from `RAY_INSTALLING` are: `RAY_RUNNING`,
`RAY_INSTALL_FAILED`, `RAY_STOPPED`, `TERMINATING`, `TERMINATED`.
`RAY_STOP_REQUESTED` is not in this set.

Signed-off-by: Johanna Reiml <johanna@reiml.dev>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AutoScaler V2] Invalid status transition from QUEUED to RAY_STOP_REQUESTED

3 participants