Skip to content

[data] Add node_id, pid, attempt # for hanging tasks#59793

Merged
alexeykudinkin merged 11 commits intoray-project:masterfrom
iamjustinhsu:jhsu/add-more-info-to-hanging-tasks
Jan 9, 2026
Merged

[data] Add node_id, pid, attempt # for hanging tasks#59793
alexeykudinkin merged 11 commits intoray-project:masterfrom
iamjustinhsu:jhsu/add-more-info-to-hanging-tasks

Conversation

@iamjustinhsu
Copy link
Contributor

@iamjustinhsu iamjustinhsu commented Dec 31, 2025

Description

Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely:

  • node_id
  • pid
  • attempt #

I did consider adding this to high memory detector, but avoided for 2 reasons

  • requires more refractor of RunningTaskInfo
  • afaik, not helpful in debugging since high memory is after the task completes

Example script to trigger hanging issues

import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()

Related issues

None

Additional information

None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner December 31, 2025 21:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to enhance the debuggability of hanging tasks by incorporating node_id, pid, and attempt # into the hanging task detector's output. This is achieved by passing the task_id through the operator pipeline to OpRuntimeMetrics, which is then used by the HangingExecutionIssueDetector to fetch detailed task information. The implementation is generally sound, with a beneficial refactoring in physical_operator.py. However, I've identified a critical issue in hash_shuffle.py where arguments to on_task_submitted are incorrectly ordered, which would result in a runtime error.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 1, 2026
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>


@dataclass
class RunningTaskInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this state is not serialized and persisted anywhere correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah it's not

Comment on lines +159 to +162
task_state = ray.util.state.get_task(
task_info.task_id.hex(),
timeout=1.0
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in _explain=True and log the explanation in the event of a failure?

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jan 7, 2026
iamjustinhsu and others added 4 commits January 7, 2026 15:18
…tector.py

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@alexeykudinkin alexeykudinkin merged commit a28405f into ray-project:master Jan 9, 2026
6 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/add-more-info-to-hanging-tasks branch January 9, 2026 19:06
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #

I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_

## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig

ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
bveeramani pushed a commit that referenced this pull request Feb 3, 2026
## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in #59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #

I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_

## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```




## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #

I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_

## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```




## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
rayhhome pushed a commit to rayhhome/ray that referenced this pull request Feb 4, 2026
…roject#60592)

## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in ray-project#59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in #59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in #59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…roject#60592)

## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in ray-project#59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #

I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_

## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig

ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…roject#60592)

## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in ray-project#59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
Currently, when displaying hanging tasks, we show ray data level task
index, which is useless for ray core debugging. This PR adds more info
to long running tasks namely:
- node_id
- pid
- attempt #

I did consider adding this to high memory detector, but avoided for 2
reasons
- requires more refractor of `RunningTaskInfo`
- afaik, not helpful in debugging since high memory is _after the task
completes_

## Example script to trigger hanging issues
```python
import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig

ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()
```

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…roject#60592)

## Description
Previously, I added `task_id`, `node_id`, and `attempt_number` for
hanging tasks in ray-project#59793. However,
this introduced a race condition when querying for task state:
1. Task is submitted
2. Issue detector immediately fires off
3. `get_task` returns `None`
https://github.com/iamjustinhsu/ray/blob/75f9731f69f4b9c7b973f53b74d0580adb3c4ab9/python/ray/data/_internal/issue_detection/detectors/hanging_detector.py#L161
because task state not ready.

for 2), we only fire off when the task wasn't hanging before, or if the
task has produced bytes since last checked. My fix is to _also_ check if
`previous_state.task_state` is `None` too

I ran this many times, and the race condition stopped. Open to ideas on
testing this too

## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants