Skip to content

[core][rdt] Atomically send/recv for two-sided ordering#60202

Merged
dayshah merged 1 commit intoray-project:masterfrom
dayshah:fix-rdt-collective
Jan 16, 2026
Merged

[core][rdt] Atomically send/recv for two-sided ordering#60202
dayshah merged 1 commit intoray-project:masterfrom
dayshah:fix-rdt-collective

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Jan 16, 2026

Description

This #59610 caused rdt release test failures when using nccl. This is because you could end up in a situation where both the main user python thread and io context thread are both calling trigger_out_of_band_tensor_transfer at the same time. For two-sided communication we need matching sends and recvs to be in the same order on two sets of actors. If trigger_out_of_band_tensor_transfer is being called on two threads, you could have something where you

  1. submit send1 on A
  2. submit send2 on A
  3. submit recv2 on B
  4. submit recv1 on B

This will cause it to hang forever because you need send1 and recv1 to be executing at the same time to unblock each other.

Fixing it by placing a lock on the task submission so they're guaranteed to always be submitted together.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from a team as a code owner January 16, 2026 02:38
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Jan 16, 2026
@dayshah dayshah changed the title [core][rdt] Atomically send/recv for two-sided for ordering [core][rdt] Atomically send/recv for two-sided ordering Jan 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical race condition in trigger_out_of_band_tensor_transfer which could cause deadlocks when using two-sided communication backends like NCCL. The fix correctly ensures atomicity of send and receive task submissions by placing them within a lock. My review includes a suggestion to refine the locking strategy to minimize the lock-holding duration, which could improve performance and reduce the risk of deadlocks, while still maintaining the fix's correctness.

@dayshah dayshah requested a review from Sparks0219 January 16, 2026 03:02
Copy link
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

@dayshah dayshah merged commit 1fd6d11 into ray-project:master Jan 16, 2026
5 checks passed
@dayshah dayshah deleted the fix-rdt-collective branch January 16, 2026 04:45
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 18, 2026
…60202)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request Jan 20, 2026
…60202)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
sampan-s-nayak added a commit that referenced this pull request Feb 6, 2026
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…60202)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…60202)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants