Skip to content

[Core] Update the attempt number of the actor creation task when actor restart#58877

Merged
dayshah merged 6 commits intoray-project:masterfrom
MengjinYan:core-2546
Jan 22, 2026
Merged

[Core] Update the attempt number of the actor creation task when actor restart#58877
dayshah merged 6 commits intoray-project:masterfrom
MengjinYan:core-2546

Conversation

@MengjinYan
Copy link
Contributor

@MengjinYan MengjinYan commented Nov 21, 2025

Description

Recently, we realized that during actor restarts, we will generate duplicate task events for the same actor creation task id and attempt 0. This is because, the actor restart process happens inside GCS and the corresponding actor creation task TaskSpec won't be updated with the new attempt number.

This PR adds the logic to update the attempt number to the actor creation task TaskSpec and add a test to verify the change.

Related issues

N/A

Additional information

N/A

@MengjinYan
Copy link
Contributor Author

Another issue we found during the investigation is that, because part of the actor creation process is done on GCS, the task events won't be sent for some of the task state transitions (SUBMITTED_TO_WORKER and PENDING_NODE_ASSIGNMENT for restart). I'm planning to create a followup PR to address the issue to expose the missing task events for the actor creation task.

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Nov 21, 2025
@MengjinYan MengjinYan marked this pull request as ready for review November 21, 2025 21:14
@MengjinYan MengjinYan requested a review from a team as a code owner November 21, 2025 21:14
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 22, 2025
assert event["task_definition_event"]["task_attempt"] == 0
assert (
event["task_definition_event"]["language"] == "PYTHON"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to just make the expected event and do event["task_definition_event"] = expected_event to make it a little cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Because event fields like event_id is not deterministic.

I think one solution I can try is to extract all the deterministic fields and compare them with generated expected_event with those fields. Do you think it will be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think something like that could be more readable, but not blocking this for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test also seems pretty complicated for testing a simple change, is there another export task event test for actors and can testing this logic be a part of that test or extension to that test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added test is to add one more scenario for actor/task execution so a whole new test so I think it is inevitable. At the same time, I can have followup PR to make the tests more readable.


# Add a second node to the cluster for the actor to be restarted on
cluster.add_node(num_cpus=2)
cluster.wait_for_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_node has wait: bool = True anyways


driver_task_definition_received = False
actor_creation_task_definition_received = False
for event in events:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the order of events deterministic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really as well. The order of the events from the same worker process should be deterministic but not across processes. And events from different processes can be interleaved with each other.

Comment on lines +1706 to +1721
script = """
import ray
import time
ray.init()

@ray.remote(num_cpus=2, max_restarts=-1, max_task_retries=-1)
class Actor:
def __init__(self):
pass

def actor_task(self):
pass

actor = Actor.remote()
time.sleep(999) # Keep the actor alive
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use textwrap.dedent to make this readable.

script, httpserver, ray_start_cluster_head_with_env_vars, validate_events
)

@_cluster_with_aggregator_target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at the fixture while reviewing this and I think the env_var_prefix makes it hard to find RAY_DASHBOARD_AGGREGATOR_AGENT_PRESERVE_PROTO_FIELD_NAME config. This is one of the few cases in which duplication is probably better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Discussed offline. We should fix that in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MengjinYan
Copy link
Contributor Author

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

  1. After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.
  2. With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

@MengjinYan
Copy link
Contributor Author

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

  1. After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.
  2. With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

With the investigation,

  1. The task doesn't showup because the events for the actor creation task retry doesn't contain the TaskInfo and the list_tasks API just filtered the entry from the task table. And the missing of the TaskInfo is mainly due to missing task events for the actor creation task scheduling process in GCS. The following up PR of adding more events will fix the issue.
  2. This issue can be workaround in test by killing the actor process directly. We should further investigate and fix the issue.

For this PR, I fixed the comment to persist the actor creation task spec after updating the attempt_number.

I also tested locally with the following test case:

def test_actor_creation_task_retry_with_gcs_restart(ray_start_regular_with_external_redis):
    @ray.remote(max_restarts=-1, max_task_retries=-1)
    class Actor:
        def __init__(self):
            pass

        def get_pid(self):
            return os.getpid()
    
    # Create an actor and make sure it is alive.
    actor = Actor.remote()
    pid_1 = ray.get(actor.get_pid.remote())

    # Kill the actor and checks the actor creation task is retried.
    actor_process_1 = psutil.Process(pid_1)
    actor_process_1.kill()
    actor_process_1.wait(timeout=10)
    pid_2 = ray.get(actor.get_pid.remote())
    assert pid_2 != pid_1

    # Kill and restart the GCS
    ray._private.worker._global_node.kill_gcs_server()
    ray._private.worker._global_node.start_gcs_server()

    # Check the actor works with the new GCS
    pid_3 = ray.get(actor.get_pid.remote())
    assert pid_3 == pid_2

    # Kill the actor and checks the actor creation task retry number is 2
    # ray.kill(actor, no_restart=False)
    # wait_for_condition(lambda: list_actors()[0].state == "RESTARTING")
    actor_process_2 = psutil.Process(pid_2)
    actor_process_2.kill()
    actor_process_2.wait(timeout=10)
    pid_4 = ray.get(actor.get_pid.remote())
    assert pid_4 != pid_2
    
    tasks = ray.util.state.list_tasks()
    print(tasks)

This is not added to the PR because the list task issue mentioned above needs to be fixed first. But with the local test, I verified by emitting logs about the raw task events and verified that, after the GCS restart, the restart of the actor will have attempt_number == 2 which is +1 of the attempt_number before the GCS restart. This means that attempt_number persists correctly.

[2025-11-24 12:39:53,998 I 98546 26382886] (gcs_server) gcs_task_manager.cc:479: [myan] Task event: {"taskId":"4NwXTINZkDTv+D02fZFno4ilfd4BAAAA","a
ttemptNumber":2,"taskInfo":{"type":"ACTOR_TASK","name":"Actor.get_pid","funcOrClassName":"Actor.get_pid","jobId":"AQAAAA==","taskId":"4NwXTINZkDTv+
D02fZFno4ilfd4BAAAA","parentTaskId":"//////////////////////////8BAAAA","requiredResources":{"CPU":1},"runtimeEnvInfo":{"serializedRuntimeEnv":"{}",
"runtimeEnvConfig":{"setupTimeoutSeconds":600,"eagerInstall":true}},"actorId":"7/g9Nn2RZ6OIpX3eAQAAAA=="},"stateUpdates":{"nodeId":"+o2EmR4Mt6foTRq
1pwJeYijQMhIvWLut2ebvEQ==","workerId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","workerPid":98607,"stateTsNs":{"1":"1764016791934715000","5":"1764
016792935842000","11":"1764016793027231000","7":"1764016792937619000","2":"1764016792935806000","8":"1764016792937639000"}},"profileEvents":{"compo
nentType":"worker","componentId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","nodeIpAddress":"127.0.0.1","events":[{"startTime":"1764016792938575000
","endTime":"1764016792938579000","extraData":"{}","eventName":"task:deserialize_arguments"},{"startTime":"1764016792938593000","endTime":"17640167
92938636000","extraData":"{}","eventName":"task:execute"},{"startTime":"1764016792938641000","endTime":"1764016793026486000","extraData":"{}","even
tName":"task:store_outputs"},{"startTime":"1764016792937790000","endTime":"1764016793026920000","extraData":"{\"name\": \"get_pid\", \"task_id\": \
"e0dc174c83599034eff83d367d9167a388a57dde01000000\"}","eventName":"task::Actor.get_pid"}]},"jobId":"AQAAAA=="}```

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 9, 2025
@edoakes
Copy link
Collaborator

edoakes commented Dec 9, 2025

not stale

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 10, 2025
@MengjinYan MengjinYan marked this pull request as draft January 13, 2026 23:53
@MengjinYan MengjinYan marked this pull request as ready for review January 14, 2026 23:21
"RUNNING",
}
expected_task_id_states_dict = {
(actor_creation_task_id, 1): expected_actor_retry_task_states,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test expects incomplete state set causing potential flakiness

Low Severity

The validate_actor_restart function expects only {"RUNNING"} state for the restarted actor creation task, but check_task_lifecycle_event_states_and_error_info uses strict equality comparison (line 184 in context). When the actor's __init__ completes and FINISHED event arrives, the actual states become {"RUNNING", "FINISHED"}, which won't equal {"RUNNING"}. Since wait_for_condition retries and events accumulate in the httpserver log, the test may timeout if both states arrive before validation.

Fix in Cursor Fix in Web

Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple questions

assert event["task_definition_event"]["task_attempt"] == 0
assert (
event["task_definition_event"]["language"] == "PYTHON"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think something like that could be more readable, but not blocking this for now

assert event["task_definition_event"]["task_attempt"] == 0
assert (
event["task_definition_event"]["language"] == "PYTHON"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test also seems pretty complicated for testing a simple change, is there another export task event test for actors and can testing this logic be a part of that test or extension to that test?

actor->UpdateState(rpc::ActorTableData::RESTARTING);
actor->GetMutableTaskSpec()->set_attempt_number(new_num_restarts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who reads num_restarts vs. who reads the attempt number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_restarts will be part of the actor lifecycle information. And the attempt_number is the concept with the actor creation task.

Whoever consumes the actor table information in GCS or the actor events will consume the num_restarts field. And whoever consumes the task events should consume the attempt_number field.

Hope that answers the question.

Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shipping but I think there's something here where the actor task spec table and actor table are modified separately and could have differing views when the gcs dies. If the gcs dies in the middle, we're gonna have diff info in the actor table and actor task spec table, and on restart...

@MengjinYan MengjinYan requested a review from dayshah January 16, 2026 19:32
@MengjinYan MengjinYan requested review from a team, aslonnie and edoakes as code owners January 21, 2026 19:42
@MengjinYan MengjinYan marked this pull request as draft January 21, 2026 19:43
MengjinYan and others added 5 commits January 21, 2026 12:25
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@MengjinYan MengjinYan marked this pull request as ready for review January 21, 2026 23:46
@aslonnie aslonnie removed request for a team and aslonnie January 22, 2026 07:11
@dayshah dayshah merged commit dfefb69 into ray-project:master Jan 22, 2026
6 checks passed
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…r restart (ray-project#58877)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…r restart (ray-project#58877)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…r restart (ray-project#58877)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…r restart (ray-project#58877)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants