[data] Add node_id, pid, attempt # for hanging tasks by iamjustinhsu · Pull Request #59793 · ray-project/ray

iamjustinhsu · 2025-12-31T21:56:08Z

Description

Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely:

node_id
pid
attempt #

I did consider adding this to high memory detector, but avoided for 2 reasons

requires more refractor of RunningTaskInfo
afaik, not helpful in debugging since high memory is after the task completes

Example script to trigger hanging issues

import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()

Related issues

None

Additional information

None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to enhance the debuggability of hanging tasks by incorporating node_id, pid, and attempt # into the hanging task detector's output. This is achieved by passing the task_id through the operator pipeline to OpRuntimeMetrics, which is then used by the HangingExecutionIssueDetector to fetch detailed task information. The implementation is generally sound, with a beneficial refactoring in physical_operator.py. However, I've identified a critical issue in hash_shuffle.py where arguments to on_task_submitted are incorrectly ordered, which would result in a runtime error.

python/ray/data/_internal/execution/operators/hash_shuffle.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

goutamvenkat-anyscale · 2026-01-05T22:42:08Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py



 @dataclass
 class RunningTaskInfo:


I assume this state is not serialized and persisted anywhere correct?

nah it's not

goutamvenkat-anyscale · 2026-01-05T23:06:49Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

+                            task_state = ray.util.state.get_task(
+                                task_info.task_id.hex(),
+                                timeout=1.0
+                            )


Can we pass in _explain=True and log the explanation in the event of a failure?

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

…tector.py Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>

…/add-more-info-to-hanging-tasks

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

## Description Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely: - node_id - pid - attempt # I did consider adding this to high memory detector, but avoided for 2 reasons - requires more refractor of `RunningTaskInfo` - afaik, not helpful in debugging since high memory is _after the task completes_ ## Example script to trigger hanging issues ```python import ray import time from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig ctx = ray.data.DataContext.get_current() ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig( detection_time_interval_s=1.0, ) def sleep(x): if x['id'] == 0: time.sleep(100) return x ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize() ``` ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in #59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

## Description Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely: - node_id - pid - attempt # I did consider adding this to high memory detector, but avoided for 2 reasons - requires more refractor of `RunningTaskInfo` - afaik, not helpful in debugging since high memory is _after the task completes_ ## Example script to trigger hanging issues ```python import ray import time from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig ctx = ray.data.DataContext.get_current() ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig( detection_time_interval_s=1.0, ) def sleep(x): if x['id'] == 0: time.sleep(100) return x ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize() ``` ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…roject#60592) ## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in ray-project#59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in #59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in #59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…roject#60592) ## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in ray-project#59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

## Description Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely: - node_id - pid - attempt # I did consider adding this to high memory detector, but avoided for 2 reasons - requires more refractor of `RunningTaskInfo` - afaik, not helpful in debugging since high memory is _after the task completes_ ## Example script to trigger hanging issues ```python import ray import time from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig ctx = ray.data.DataContext.get_current() ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig( detection_time_interval_s=1.0, ) def sleep(x): if x['id'] == 0: time.sleep(100) return x ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize() ``` ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…roject#60592) ## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in ray-project#59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

## Description Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely: - node_id - pid - attempt # I did consider adding this to high memory detector, but avoided for 2 reasons - requires more refractor of `RunningTaskInfo` - afaik, not helpful in debugging since high memory is _after the task completes_ ## Example script to trigger hanging issues ```python import ray import time from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig ctx = ray.data.DataContext.get_current() ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig( detection_time_interval_s=1.0, ) def sleep(x): if x['id'] == 0: time.sleep(100) return x ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize() ``` ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…roject#60592) ## Description Previously, I added `task_id`, `node_id`, and `attempt_number` for hanging tasks in ray-project#59793. However, this introduced a race condition when querying for task state: 1. Task is submitted 2. Issue detector immediately fires off 3. `get_task` returns `None` https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161 because task state not ready. for 2), we only fire off when the task wasn't hanging before, or if the task has produced bytes since last checked. My fix is to _also_ check if `previous_state.task_state` is `None` too I ran this many times, and the race condition stopped. Open to ideas on testing this too ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[data] Add node_id, pid, attempt # for hanging tasks

57b7e6a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner December 31, 2025 21:56

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

cursor bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

reorder

a051a8e

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

iamjustinhsu added 2 commits December 31, 2025 14:41

handle exceptions

53a2494

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

refactor

982c855

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 1, 2026

change timeout to 1s

c42db1c

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Jan 5, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Show resolved Hide resolved

goutamvenkat-anyscale reviewed Jan 5, 2026

View reviewed changes

iamjustinhsu added 2 commits January 5, 2026 15:38

fix

4c36197

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

case for list

1e98d89

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

goutamvenkat-anyscale approved these changes Jan 7, 2026

View reviewed changes

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jan 7, 2026

alexeykudinkin approved these changes Jan 7, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

iamjustinhsu and others added 4 commits January 7, 2026 15:18

Update python/ray/data/_internal/issue_detection/detectors/hanging_de…

e8a4d2e

…tector.py Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>

Merge branch 'master' of https://github.com/ray-project/ray into jhsu…

ee4c6ac

…/add-more-info-to-hanging-tasks

fix test

01dc318

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

lint

772f313

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

alexeykudinkin merged commit a28405f into ray-project:master Jan 9, 2026
6 checks passed

iamjustinhsu deleted the jhsu/add-more-info-to-hanging-tasks branch January 9, 2026 19:06

iamjustinhsu mentioned this pull request Jan 29, 2026

[data] continue grabbing task state until response is not None #60592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add node_id, pid, attempt # for hanging tasks#59793

[data] Add node_id, pid, attempt # for hanging tasks#59793
alexeykudinkin merged 11 commits intoray-project:masterfrom
iamjustinhsu:jhsu/add-more-info-to-hanging-tasks

iamjustinhsu commented Dec 31, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Uh oh!

iamjustinhsu Jan 5, 2026

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iamjustinhsu commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example script to trigger hanging issues

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iamjustinhsu commented Dec 31, 2025 •

edited

Loading