Skip to content

[core] Fix num retries left message#59829

Merged
edoakes merged 2 commits intoray-project:masterfrom
dayshah:fix-num-retries-msg
Jan 6, 2026
Merged

[core] Fix num retries left message#59829
edoakes merged 2 commits intoray-project:masterfrom
dayshah:fix-num-retries-msg

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Jan 3, 2026

Description

Before you would get a message that looks like:

Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted

when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries.

Also removing an unused ray config - raylet_fetch_timeout_milliseconds.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from a team as a code owner January 3, 2026 23:55
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Jan 3, 2026
num_retries_left == -1 ? "infinite" : std::to_string(num_retries_left);
RAY_LOG(INFO) << "task " << spec.TaskId() << " retries left: " << num_retries_left_str
<< ", oom retries left: " << num_oom_retries_left
<< ", task failed due to oom: " << task_failed_due_to_oom;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this log into the lock so the num_oom_retries_left and task_failed_due_to_oom can be refs and don't need to be created outside the scope.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a confusing log message about the number of retries left by displaying the count before the decrement. The change to use references for the retry counters is a good simplification. The removal of the unused raylet_fetch_timeout_milliseconds configuration is also a nice cleanup.

I have one suggestion in src/ray/core_worker/task_manager.cc to improve the readability and efficiency of the error message construction.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Jan 4, 2026
Signed-off-by: dayshah <dhyey2019@gmail.com>
@edoakes edoakes merged commit 0870e5d into ray-project:master Jan 6, 2026
6 checks passed
@dayshah dayshah deleted the fix-num-retries-msg branch January 6, 2026 01:51
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description
Before you would get a message that looks like:
```
Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted
```
when a task with 1 retry gets retried. Doesn't make sense to get retried
when there's "0 retries". This is because we decrement before pushing
this message. Fixing this with a +1 in the msg str. Also we'd print the
same thing for preemptions which could possibly not make sense, e.g. 0
retries remaining but still retrying when node is preempted because it
doesn't count against retries. So now it explicitly says that this retry
is because of preemption and it won't count against retries.

Also removing an unused ray config -
`raylet_fetch_timeout_milliseconds`.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description
Before you would get a message that looks like:
```
Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted
```
when a task with 1 retry gets retried. Doesn't make sense to get retried
when there's "0 retries". This is because we decrement before pushing
this message. Fixing this with a +1 in the msg str. Also we'd print the
same thing for preemptions which could possibly not make sense, e.g. 0
retries remaining but still retrying when node is preempted because it
doesn't count against retries. So now it explicitly says that this retry
is because of preemption and it won't count against retries.

Also removing an unused ray config -
`raylet_fetch_timeout_milliseconds`.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
## Description
Before you would get a message that looks like:
```
Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted
```
when a task with 1 retry gets retried. Doesn't make sense to get retried
when there's "0 retries". This is because we decrement before pushing
this message. Fixing this with a +1 in the msg str. Also we'd print the
same thing for preemptions which could possibly not make sense, e.g. 0
retries remaining but still retrying when node is preempted because it
doesn't count against retries. So now it explicitly says that this retry
is because of preemption and it won't count against retries.

Also removing an unused ray config -
`raylet_fetch_timeout_milliseconds`.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
Before you would get a message that looks like:
```
Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted
```
when a task with 1 retry gets retried. Doesn't make sense to get retried
when there's "0 retries". This is because we decrement before pushing
this message. Fixing this with a +1 in the msg str. Also we'd print the
same thing for preemptions which could possibly not make sense, e.g. 0
retries remaining but still retrying when node is preempted because it
doesn't count against retries. So now it explicitly says that this retry
is because of preemption and it won't count against retries.

Also removing an unused ray config -
`raylet_fetch_timeout_milliseconds`.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
Before you would get a message that looks like:
```
Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted
```
when a task with 1 retry gets retried. Doesn't make sense to get retried
when there's "0 retries". This is because we decrement before pushing
this message. Fixing this with a +1 in the msg str. Also we'd print the
same thing for preemptions which could possibly not make sense, e.g. 0
retries remaining but still retrying when node is preempted because it
doesn't count against retries. So now it explicitly says that this retry
is because of preemption and it won't count against retries.

Also removing an unused ray config -
`raylet_fetch_timeout_milliseconds`.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants