[core] Fix num retries left message#59829
Merged
edoakes merged 2 commits intoray-project:masterfrom Jan 6, 2026
Merged
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
dayshah
commented
Jan 3, 2026
| num_retries_left == -1 ? "infinite" : std::to_string(num_retries_left); | ||
| RAY_LOG(INFO) << "task " << spec.TaskId() << " retries left: " << num_retries_left_str | ||
| << ", oom retries left: " << num_oom_retries_left | ||
| << ", task failed due to oom: " << task_failed_due_to_oom; |
Contributor
Author
There was a problem hiding this comment.
Moving this log into the lock so the num_oom_retries_left and task_failed_due_to_oom can be refs and don't need to be created outside the scope.
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a confusing log message about the number of retries left by displaying the count before the decrement. The change to use references for the retry counters is a good simplification. The removal of the unused raylet_fetch_timeout_milliseconds configuration is also a nice cleanup.
I have one suggestion in src/ray/core_worker/task_manager.cc to improve the readability and efficiency of the error message construction.
edoakes
approved these changes
Jan 6, 2026
AYou0207
pushed a commit
to AYou0207/ray
that referenced
this pull request
Jan 13, 2026
## Description Before you would get a message that looks like: ``` Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted ``` when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries. Also removing an unused ray config - `raylet_fetch_timeout_milliseconds`. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561
pushed a commit
to pinterest/ray
that referenced
this pull request
Feb 3, 2026
## Description Before you would get a message that looks like: ``` Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted ``` when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries. Also removing an unused ray config - `raylet_fetch_timeout_milliseconds`. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this pull request
Feb 3, 2026
## Description Before you would get a message that looks like: ``` Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted ``` when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries. Also removing an unused ray config - `raylet_fetch_timeout_milliseconds`. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
## Description Before you would get a message that looks like: ``` Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted ``` when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries. Also removing an unused ray config - `raylet_fetch_timeout_milliseconds`. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
## Description Before you would get a message that looks like: ``` Task Actor.f failed. There are 0 retries remaining, so the task will be retried. Error: The actor is temporarily unavailable: IOError: The actor was restarted ``` when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries. Also removing an unused ray config - `raylet_fetch_timeout_milliseconds`. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Before you would get a message that looks like:
when a task with 1 retry gets retried. Doesn't make sense to get retried when there's "0 retries". This is because we decrement before pushing this message. Fixing this with a +1 in the msg str. Also we'd print the same thing for preemptions which could possibly not make sense, e.g. 0 retries remaining but still retrying when node is preempted because it doesn't count against retries. So now it explicitly says that this retry is because of preemption and it won't count against retries.
Also removing an unused ray config -
raylet_fetch_timeout_milliseconds.