[Core] Update the attempt number of the actor creation task when actor restart by MengjinYan · Pull Request #58877 · ray-project/ray

MengjinYan · 2025-11-21T07:09:17Z

Description

Recently, we realized that during actor restarts, we will generate duplicate task events for the same actor creation task id and attempt 0. This is because, the actor restart process happens inside GCS and the corresponding actor creation task TaskSpec won't be updated with the new attempt number.

This PR adds the logic to update the attempt number to the actor creation task TaskSpec and add a test to verify the change.

Related issues

N/A

Additional information

N/A

MengjinYan · 2025-11-21T07:28:24Z

Another issue we found during the investigation is that, because part of the actor creation process is done on GCS, the task events won't be sent for some of the task state transitions (SUBMITTED_TO_WORKER and PENDING_NODE_ASSIGNMENT for restart). I'm planning to create a followup PR to address the issue to expose the missing task events for the actor creation task.

src/ray/gcs/gcs_actor_manager.cc

dayshah · 2025-11-22T03:35:17Z

python/ray/tests/test_ray_event_export_task_events.py

+                            assert event["task_definition_event"]["task_attempt"] == 0
+                            assert (
+                                event["task_definition_event"]["language"] == "PYTHON"
+                            )


is it possible to just make the expected event and do event["task_definition_event"] = expected_event to make it a little cleaner

Not really. Because event fields like event_id is not deterministic.

I think one solution I can try is to extract all the deterministic fields and compare them with generated expected_event with those fields. Do you think it will be better?

Ya I think something like that could be more readable, but not blocking this for now

The test also seems pretty complicated for testing a simple change, is there another export task event test for actors and can testing this logic be a part of that test or extension to that test?

The added test is to add one more scenario for actor/task execution so a whole new test so I think it is inevitable. At the same time, I can have followup PR to make the tests more readable.

dayshah · 2025-11-22T03:38:09Z

python/ray/tests/test_ray_event_export_task_events.py

+
+        # Add a second node to the cluster for the actor to be restarted on
+        cluster.add_node(num_cpus=2)
+        cluster.wait_for_nodes()


add_node has wait: bool = True anyways

dayshah · 2025-11-22T03:39:46Z

python/ray/tests/test_ray_event_export_task_events.py

+
+            driver_task_definition_received = False
+            actor_creation_task_definition_received = False
+            for event in events:


is the order of events deterministic?

Got it, thanks!

Not really as well. The order of the events from the same worker process should be deterministic but not across processes. And events from different processes can be interleaved with each other.

israbbani · 2025-11-24T17:19:11Z

python/ray/tests/test_ray_event_export_task_events.py

+        script = """
+import ray
+import time
+ray.init()
+
+@ray.remote(num_cpus=2, max_restarts=-1, max_task_retries=-1)
+class Actor:
+    def __init__(self):
+        pass
+
+    def actor_task(self):
+        pass
+
+actor = Actor.remote()
+time.sleep(999) # Keep the actor alive
+        """


I like to use textwrap.dedent to make this readable.

israbbani · 2025-11-24T19:11:04Z

python/ray/tests/test_ray_event_export_task_events.py

            script, httpserver, ray_start_cluster_head_with_env_vars, validate_events
        )

+    @_cluster_with_aggregator_target


I'm looking at the fixture while reviewing this and I think the env_var_prefix makes it hard to find RAY_DASHBOARD_AGGREGATOR_AGENT_PRESERVE_PROTO_FIELD_NAME config. This is one of the few cases in which duplication is probably better.

Good point. Discussed offline. We should fix that in a separate PR.

MengjinYan · 2025-11-24T19:50:01Z

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.
With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

MengjinYan · 2025-11-25T01:30:55Z

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.

With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

With the investigation,

The task doesn't showup because the events for the actor creation task retry doesn't contain the TaskInfo and the list_tasks API just filtered the entry from the task table. And the missing of the TaskInfo is mainly due to missing task events for the actor creation task scheduling process in GCS. The following up PR of adding more events will fix the issue.
This issue can be workaround in test by killing the actor process directly. We should further investigate and fix the issue.

For this PR, I fixed the comment to persist the actor creation task spec after updating the attempt_number.

I also tested locally with the following test case:

def test_actor_creation_task_retry_with_gcs_restart(ray_start_regular_with_external_redis):
    @ray.remote(max_restarts=-1, max_task_retries=-1)
    class Actor:
        def __init__(self):
            pass

        def get_pid(self):
            return os.getpid()
    
    # Create an actor and make sure it is alive.
    actor = Actor.remote()
    pid_1 = ray.get(actor.get_pid.remote())

    # Kill the actor and checks the actor creation task is retried.
    actor_process_1 = psutil.Process(pid_1)
    actor_process_1.kill()
    actor_process_1.wait(timeout=10)
    pid_2 = ray.get(actor.get_pid.remote())
    assert pid_2 != pid_1

    # Kill and restart the GCS
    ray._private.worker._global_node.kill_gcs_server()
    ray._private.worker._global_node.start_gcs_server()

    # Check the actor works with the new GCS
    pid_3 = ray.get(actor.get_pid.remote())
    assert pid_3 == pid_2

    # Kill the actor and checks the actor creation task retry number is 2
    # ray.kill(actor, no_restart=False)
    # wait_for_condition(lambda: list_actors()[0].state == "RESTARTING")
    actor_process_2 = psutil.Process(pid_2)
    actor_process_2.kill()
    actor_process_2.wait(timeout=10)
    pid_4 = ray.get(actor.get_pid.remote())
    assert pid_4 != pid_2
    
    tasks = ray.util.state.list_tasks()
    print(tasks)

This is not added to the PR because the list task issue mentioned above needs to be fixed first. But with the local test, I verified by emitting logs about the raw task events and verified that, after the GCS restart, the restart of the actor will have attempt_number == 2 which is +1 of the attempt_number before the GCS restart. This means that attempt_number persists correctly.

[2025-11-24 12:39:53,998 I 98546 26382886] (gcs_server) gcs_task_manager.cc:479: [myan] Task event: {"taskId":"4NwXTINZkDTv+D02fZFno4ilfd4BAAAA","a
ttemptNumber":2,"taskInfo":{"type":"ACTOR_TASK","name":"Actor.get_pid","funcOrClassName":"Actor.get_pid","jobId":"AQAAAA==","taskId":"4NwXTINZkDTv+
D02fZFno4ilfd4BAAAA","parentTaskId":"//////////////////////////8BAAAA","requiredResources":{"CPU":1},"runtimeEnvInfo":{"serializedRuntimeEnv":"{}",
"runtimeEnvConfig":{"setupTimeoutSeconds":600,"eagerInstall":true}},"actorId":"7/g9Nn2RZ6OIpX3eAQAAAA=="},"stateUpdates":{"nodeId":"+o2EmR4Mt6foTRq
1pwJeYijQMhIvWLut2ebvEQ==","workerId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","workerPid":98607,"stateTsNs":{"1":"1764016791934715000","5":"1764
016792935842000","11":"1764016793027231000","7":"1764016792937619000","2":"1764016792935806000","8":"1764016792937639000"}},"profileEvents":{"compo
nentType":"worker","componentId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","nodeIpAddress":"127.0.0.1","events":[{"startTime":"1764016792938575000
","endTime":"1764016792938579000","extraData":"{}","eventName":"task:deserialize_arguments"},{"startTime":"1764016792938593000","endTime":"17640167
92938636000","extraData":"{}","eventName":"task:execute"},{"startTime":"1764016792938641000","endTime":"1764016793026486000","extraData":"{}","even
tName":"task:store_outputs"},{"startTime":"1764016792937790000","endTime":"1764016793026920000","extraData":"{\"name\": \"get_pid\", \"task_id\": \
"e0dc174c83599034eff83d367d9167a388a57dde01000000\"}","eventName":"task::Actor.get_pid"}]},"jobId":"AQAAAA=="}```

github-actions · 2025-12-09T12:25:51Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

edoakes · 2025-12-09T15:22:55Z

not stale

python/ray/tests/test_ray_event_export_task_events.py

cursor · 2026-01-14T23:34:08Z

python/ray/tests/test_ray_event_export_task_events.py

+                "RUNNING",
+            }
+            expected_task_id_states_dict = {
+                (actor_creation_task_id, 1): expected_actor_retry_task_states,


Test expects incomplete state set causing potential flakiness

Low Severity

The validate_actor_restart function expects only {"RUNNING"} state for the restarted actor creation task, but check_task_lifecycle_event_states_and_error_info uses strict equality comparison (line 184 in context). When the actor's __init__ completes and FINISHED event arrives, the actual states become {"RUNNING", "FINISHED"}, which won't equal {"RUNNING"}. Since wait_for_condition retries and events accumulate in the httpserver log, the test may timeout if both states arrive before validation.

dayshah

couple questions

dayshah · 2026-01-16T00:01:29Z

python/ray/tests/test_ray_event_export_task_events.py

+                            assert event["task_definition_event"]["task_attempt"] == 0
+                            assert (
+                                event["task_definition_event"]["language"] == "PYTHON"
+                            )


Ya I think something like that could be more readable, but not blocking this for now

dayshah · 2026-01-16T00:08:09Z

python/ray/tests/test_ray_event_export_task_events.py

+                            assert event["task_definition_event"]["task_attempt"] == 0
+                            assert (
+                                event["task_definition_event"]["language"] == "PYTHON"
+                            )


The test also seems pretty complicated for testing a simple change, is there another export task event test for actors and can testing this logic be a part of that test or extension to that test?

dayshah · 2026-01-16T00:09:52Z

src/ray/gcs/actor/gcs_actor_manager.cc

    actor->UpdateState(rpc::ActorTableData::RESTARTING);
+    actor->GetMutableTaskSpec()->set_attempt_number(new_num_restarts);


Who reads num_restarts vs. who reads the attempt number?

num_restarts will be part of the actor lifecycle information. And the attempt_number is the concept with the actor creation task.

Whoever consumes the actor table information in GCS or the actor events will consume the num_restarts field. And whoever consumes the task events should consume the attempt_number field.

Hope that answers the question.

dayshah

Shipping but I think there's something here where the actor task spec table and actor table are modified separately and could have differing views when the gcs dies. If the gcs dies in the middle, we're gonna have diff info in the actor table and actor task spec table, and on restart...

src/ray/gcs/gcs_actor_manager.cc

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

…r restart (ray-project#58877) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

…r restart (ray-project#58877) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: 400Ping <jiekaichang@apache.org>

…r restart (ray-project#58877) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

MengjinYan added the go add ONLY when ready to merge, run all tests label Nov 21, 2025

MengjinYan marked this pull request as ready for review November 21, 2025 21:14

MengjinYan requested a review from a team as a code owner November 21, 2025 21:14

cursor bot reviewed Nov 21, 2025

View reviewed changes

src/ray/gcs/gcs_actor_manager.cc Outdated Show resolved Hide resolved

ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 22, 2025

dayshah reviewed Nov 22, 2025

View reviewed changes

israbbani reviewed Nov 24, 2025

View reviewed changes

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 9, 2025

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 10, 2025

cursor bot reviewed Jan 12, 2026

View reviewed changes

python/ray/tests/test_ray_event_export_task_events.py Outdated Show resolved Hide resolved

MengjinYan marked this pull request as draft January 13, 2026 23:53

aslonnie mentioned this pull request Jan 14, 2026

[java] upgrade log4j to 2.25.3 #60141

Merged

MengjinYan marked this pull request as ready for review January 14, 2026 23:21

cursor bot reviewed Jan 14, 2026

View reviewed changes

dayshah reviewed Jan 16, 2026

View reviewed changes

dayshah approved these changes Jan 16, 2026

View reviewed changes

src/ray/gcs/gcs_actor_manager.cc Outdated Show resolved Hide resolved

MengjinYan requested a review from dayshah January 16, 2026 19:32

MengjinYan requested review from a team, matthewdeng and richardliaw as code owners January 21, 2026 19:42

MengjinYan requested review from a team, aslonnie and edoakes as code owners January 21, 2026 19:42

MengjinYan marked this pull request as draft January 21, 2026 19:43

MengjinYan force-pushed the core-2546 branch from 3ff7a12 to bd434b1 Compare January 21, 2026 19:46

MengjinYan and others added 5 commits January 21, 2026 12:25

update the attempt number of the actor creation task when actor restart

e791a15

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix comments

f0e4855

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix actor manager test failure

c851ab3

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

Update src/ray/gcs/gcs_actor_manager.cc

fd23b02

Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

fix build failure

768385d

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

MengjinYan force-pushed the core-2546 branch from bd434b1 to 768385d Compare January 21, 2026 22:28

Merge branch 'master' into core-2546

5b702c5

MengjinYan marked this pull request as ready for review January 21, 2026 23:46

aslonnie removed request for a team and aslonnie January 22, 2026 07:11

MengjinYan removed request for matthewdeng and richardliaw January 22, 2026 17:32

dayshah approved these changes Jan 22, 2026

View reviewed changes

dayshah merged commit dfefb69 into ray-project:master Jan 22, 2026
6 checks passed

		actor->UpdateState(rpc::ActorTableData::RESTARTING);
		actor->GetMutableTaskSpec()->set_attempt_number(new_num_restarts);

Conversation

MengjinYan commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

MengjinYan commented Nov 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengjinYan commented Nov 24, 2025

Uh oh!

MengjinYan commented Nov 25, 2025

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

edoakes commented Dec 9, 2025

Uh oh!

Uh oh!

cursor bot Jan 14, 2026

Choose a reason for hiding this comment

Test expects incomplete state set causing potential flakiness

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MengjinYan commented Nov 21, 2025 •

edited

Loading