[core][rdt] Atomically send/recv for two-sided ordering#60202
Merged
dayshah merged 1 commit intoray-project:masterfrom Jan 16, 2026
Merged
[core][rdt] Atomically send/recv for two-sided ordering#60202dayshah merged 1 commit intoray-project:masterfrom
dayshah merged 1 commit intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a critical race condition in trigger_out_of_band_tensor_transfer which could cause deadlocks when using two-sided communication backends like NCCL. The fix correctly ensures atomicity of send and receive task submissions by placing them within a lock. My review includes a suggestion to refine the locking strategy to minimize the lock-holding duration, which could improve performance and reduce the risk of deadlocks, while still maintaining the fix's correctness.
limarkdcunha
pushed a commit
to limarkdcunha/ray
that referenced
this pull request
Jan 18, 2026
…60202) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
jeffery4011
pushed a commit
to jeffery4011/ray
that referenced
this pull request
Jan 20, 2026
…60202) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this pull request
Feb 3, 2026
…60202) Signed-off-by: dayshah <dhyey2019@gmail.com>
sampan-s-nayak
added a commit
that referenced
this pull request
Feb 6, 2026
…)" This reverts commit 1fd6d11.
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…60202) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…60202) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This #59610 caused rdt release test failures when using nccl. This is because you could end up in a situation where both the main user python thread and io context thread are both calling
trigger_out_of_band_tensor_transferat the same time. For two-sided communication we need matching sends and recvs to be in the same order on two sets of actors. Iftrigger_out_of_band_tensor_transferis being called on two threads, you could have something where youThis will cause it to hang forever because you need send1 and recv1 to be executing at the same time to unblock each other.
Fixing it by placing a lock on the task submission so they're guaranteed to always be submitted together.