Conversation
|
Review updated until commit 7283aa8 Description
|
| Relevant files | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 5 files
| ||||||||||
| Configuration changes | |||||||||||
| Documentation | 1 files
| ||||||||||
| Tests | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Spin loop performance concern
This can cause high CPU usage and may not be optimal for latency-sensitive workloads. Consider if this should yield to other threads or use a more efficient waiting mechanism. |
samnordmann
left a comment
There was a problem hiding this comment.
Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.
| #endif | ||
| } | ||
|
|
||
| void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) { |
There was a problem hiding this comment.
This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?
There was a problem hiding this comment.
Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big
|
@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package |
https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029 |
|
Note the build error in the CI |
Greptile SummaryThis PR adds a NIXL (NVIDIA Interconnect Library) backend to nvFuser's multidevice infrastructure, enabling high-performance one-sided RDMA transfers between GPU tensors across ranks using UCX. The implementation introduces a Key changes:
Two issues were found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant R0 as Rank 0
participant R1 as Rank 1
participant Store as TCPStore
participant NIXL as NIXL/UCX
R0->>R0: NixlBackend::getInstance()<br/>(create nixlAgent + UCX backend)
R1->>R1: NixlBackend::getInstance()<br/>(create nixlAgent + UCX backend)
R0->>NIXL: registerMem(src, dst)
R1->>NIXL: registerMem(src, dst)
Note over R0,R1: exchangeMetadata() — collective
R0->>Store: set("nixl_agent_md_rank_0", localMD)
R1->>Store: set("nixl_agent_md_rank_1", localMD)
R0->>Store: get("nixl_agent_md_rank_1")
R1->>Store: get("nixl_agent_md_rank_0")
R0->>NIXL: loadRemoteMD(rank1_md)
R1->>NIXL: loadRemoteMD(rank0_md)
R0-->>R1: barrier()
R0->>Store: deleteKey("nixl_agent_md_rank_0")
R1->>Store: deleteKey("nixl_agent_md_rank_1")
R0->>NIXL: createXferReq(READ, local_dst, remote_src, "rank_1")
R1->>NIXL: createXferReq(READ, local_dst, remote_src, "rank_0")
R0->>NIXL: postXferReq(handle)
R1->>NIXL: postXferReq(handle)
NIXL-->>R0: RDMA READ from Rank 1's src
NIXL-->>R1: RDMA READ from Rank 0's src
R0->>NIXL: getXferStatus() → NIXL_SUCCESS
R1->>NIXL: getXferStatus() → NIXL_SUCCESS
Note over R0,R1: deregisterTensors() — collective<br/>(deregisterMem + exchangeMetadata + barrier)
R0->>NIXL: deregisterMem(src, dst)
R1->>NIXL: deregisterMem(src, dst)
|
|
Yes, making Why it makes sense:
Trade-offs to consider:
Looking at your implementation (lines 210-268): void registerTensors(...) {
// ... register locally
metadata_exchanged_ = false;
exchangeMetadata(); // Collective: store->set(), store->get() all ranks, barrier()
}The collective nature is enforced by the barrier at line 265, which is necessary to prevent ranks from deleting keys before others read them. This is a solid design for correctness. Recommendation: Document this collective requirement clearly in the header comments for |
mdavis36
left a comment
There was a problem hiding this comment.
Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)
|
!test |
@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do? |
|
!test |
|
!test |
No description provided.