Skip to content

Add NIXL backend #6016

Open
x41lakazam wants to merge 37 commits intomainfrom
dispatch_combine/nixl_backend
Open

Add NIXL backend #6016
x41lakazam wants to merge 37 commits intomainfrom
dispatch_combine/nixl_backend

Conversation

@x41lakazam
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 26, 2026

Review updated until commit 7283aa8

Description

  • Add NIXL backend for GPU-to-GPU RDMA transfers in multi-device communication

  • Implement tensor registration, metadata exchange, and transfer preparation/post/wait APIs

  • Add NIXL build option (NVFUSER_BUILD_WITH_NIXL) to CMake and Python build system

  • Include comprehensive tests for transfer handles, validation, and end-to-end transfers

Changes walkthrough

Relevant files
Enhancement
5 files
nixl.h
Define NixlBackend and NixlTransferHandle classes               
+221/-0 
nixl.cpp
Implement NIXL backend with UCX for GPU transfers               
+474/-0 
multidevice.h
Add kNixl to CommunicatorBackend enum                                       
+1/-1     
communicator.h
Add nixl_available_ flag and backend check                             
+7/-1     
communicator.cpp
Initialize nixl_available_ and add NIXL case to output     
+9/-1     
Configuration changes
2 files
CMakeLists.txt
Add NVFUSER_STANDALONE_BUILD_WITH_NIXL option and configuration
+39/-0   
utils.py
Add build_with_nixl config and cmake flag                               
+4/-0     
Documentation
1 files
setup.py
Document NVFUSER_BUILD_WITH_NIXL build option                       
+3/-0     
Tests
1 files
test_multidevice_nixl.cpp
Add tests for NIXL backend functionality                                 
+289/-0 

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Spin loop performance concern

The waitTransfer() function at line 385-399 uses a busy-wait spin loop to poll transfer status.
This can cause high CPU usage and may not be optimal for latency-sensitive workloads.
Consider if this should yield to other threads or use a more efficient waiting mechanism.

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
  NVF_ERROR(handle.isValid(), "Cannot wait on an invalid handle");
  NVF_ERROR(handle.impl_->posted, "Transfer has not been posted yet");

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
}
Metadata exchange scalability

The exchangeMetadata() function performs O(world_size) iterations to fetch metadata from all peers
and uses a barrier. This may not scale well to large distributed configurations.
Consider if there's a more efficient approach for metadata exchange.

void NixlBackend::Impl::exchangeMetadata() {
  nixl_blob_t local_md;
  nixl_status_t md_status = agent_->getLocalMD(local_md);
  NVF_ERROR(
      md_status == NIXL_SUCCESS,
      "NIXL getLocalMD failed with status ",
      static_cast<int>(md_status));

  auto* store = communicator_.getTcpStore();
  const auto my_rank = communicator_.deviceId();
  const auto world_size = communicator_.size();

  std::string md_key_prefix = "nixl_agent_md_rank_";
  store->set(
      md_key_prefix + std::to_string(my_rank),
      std::vector<uint8_t>(local_md.begin(), local_md.end()));

  for (int64_t rank = 0; rank < world_size; ++rank) {
    if (rank == my_rank) {
      continue;
    }
    // Fetch & load MD
    auto bytes = store->get(md_key_prefix + std::to_string(rank));
    nixl_blob_t remote_md(bytes.begin(), bytes.end());
    std::string remote_agent_name;
    nixl_status_t status = agent_->loadRemoteMD(remote_md, remote_agent_name);
    NVF_ERROR(
        status == NIXL_SUCCESS,
        "NIXL loadRemoteMD failed for rank ",
        rank,
        " with status ",
        static_cast<int>(status));
  }

  // Barrier before deleting keys so no rank reads a deleted key.
  communicator_.barrier();

  store->deleteKey(md_key_prefix + std::to_string(my_rank));
  metadata_exchanged_ = true;
}
Probe mechanism validation

The UCX CUDA support probe (lines 168-210) is a good defensive addition. However, verify that
the probe correctly handles all edge cases where UCX might claim VRAM support but actually
misclassify memory. Consider adding logging when the probe fails.

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
// the UCX backend. Some UCX installations lack CUDA support, causing
// registerMem to silently misclassify VRAM as host memory. We detect
// this by registering a small buffer and asking NIXL to prepare a
// local descriptor list for VRAM -- if no backend claims VRAM, the
// probe fails and we mark the backend as unavailable.
{
  constexpr int64_t kProbeBytes = 1;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixl_status_t reg_status = impl->agent_->registerMem(reg_dlist);
  if (reg_status != NIXL_SUCCESS) {
    return nullptr;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      impl->agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);

  if (dlist_handle) {
    impl->agent_->releasedDlistH(dlist_handle);
  }
  impl->agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return nullptr;
  }
}

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

#endif
}

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big

@samnordmann
Copy link
Collaborator

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

@samnordmann
Copy link
Collaborator

samnordmann commented Feb 26, 2026

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

@samnordmann
Copy link
Collaborator

Note the build error in the CI

  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:144:17: error: private field 'communicator_' is not used [-Werror,-Wunused-private-field]
    144 |   Communicator& communicator_;
        |                 ^
  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:146:8: error: private field 'metadata_exchanged_' is not used [-Werror,-Wunused-private-field]
    146 |   bool metadata_exchanged_ = false;
        |        ^

@x41lakazam x41lakazam marked this pull request as ready for review March 2, 2026 16:23
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR adds a NIXL (NVIDIA Interconnect Library) backend to nvFuser's multidevice infrastructure, enabling high-performance one-sided RDMA transfers between GPU tensors across ranks using UCX. The implementation introduces a NixlBackend singleton wrapping a nixlAgent with the UCX transport, a NixlTransferHandle for amortized transfer setup, TCPStore-based metadata exchange for peer registration, and a full CMake/CI integration path.

Key changes:

  • csrc/multidevice/nixl.{h,cpp}: New backend with registerTensors / deregisterTensors (both collective, exchange NIXL agent metadata via TCPStore), prepareTransfer / postTransfer / waitTransfer lifecycle, and a compile-time stub when USE_NIXL is not defined.
  • csrc/multidevice/communicator.{h,cpp}: Adds kNixl to CommunicatorBackend enum and a compile-time nixl_available_ flag.
  • cmake/deps/handle_nixl.cmake: Locates NIXL headers/library from NIXL_PREFIX and validates CUDA major version compatibility.
  • tools/install-nixl.sh: CI helper that pip-installs the NIXL wheel for libnixl.so and clones the NIXL repo for C++ headers.
  • tests/cpp/test_multidevice_nixl.cpp: Unit and end-to-end ring-transfer tests.

Two issues were found:

  • ReadTransferEndToEnd calls deregisterTensors immediately after waitTransfer in a ring-read pattern without a barrier, creating a race where peer ranks may still hold in-flight RDMA reads against the to-be-deregistered source tensors.
  • tools/install-nixl.sh hardcodes nixl-cu12 unconditionally, silently installing mismatched NIXL libraries on CUDA 13 systems.

Confidence Score: 3/5

  • PR introduces a functional NIXL backend but has a race condition in the end-to-end read test and a hardcoded CUDA version in the install script that should be addressed before merging.
  • The core C++ backend logic is well-structured and most previously-flagged issues have been addressed or acknowledged. However, the race condition in ReadTransferEndToEnd (missing barrier before deregisterTensors) is a correctness bug that could cause flaky failures or undefined RDMA behavior in real multi-GPU environments. The hardcoded nixl-cu12 in the install script limits portability to CUDA 12 systems. These two issues lower confidence from an otherwise solid implementation.
  • tests/cpp/test_multidevice_nixl.cpp (race condition in ReadTransferEndToEnd) and tools/install-nixl.sh (hardcoded CUDA 12 variant) require attention before merging.

Important Files Changed

Filename Overview
csrc/multidevice/nixl.cpp Core NIXL backend implementation: NixlBackend singleton with UCX-backed RDMA transfers; previously-flagged issues around stub completeness, posted flag reset on error, deadlock in exchangeMetadata error paths, and waitTransfer spin-loop have all been addressed or acknowledged by the team.
csrc/multidevice/nixl.h New public header: defines TensorDesc, NixlTransferHandle, NixlBackend, and TCPStore serialization helpers; TensorDesc::dev is correctly documented as the CUDA device index (not the communicator rank), addressing the earlier bug.
tests/cpp/test_multidevice_nixl.cpp New NIXL test suite with unit and end-to-end tests; ReadTransferEndToEnd has a race condition — it calls deregisterTensors without a barrier after waitTransfer, while peer ranks may still be performing in-flight RDMA reads from the to-be-deregistered source tensor.
tools/install-nixl.sh CI helper to install NIXL headers and libraries; hardcodes nixl-cu12 regardless of the system CUDA version, which will silently install mismatched libraries on CUDA 13 systems.
cmake/deps/handle_nixl.cmake New CMake handler for NIXL: finds headers/libraries, sets NIXL_FOUND, and includes a Python-based CUDA major version compatibility check with clear status reporting.
csrc/multidevice/communicator.cpp Adds nixl_available_ member set from the USE_NIXL compile-time flag; minor refactors (std::ranges, endl→'\n'); the gap between compile-time availability and runtime availability was flagged in a prior thread.
csrc/multidevice/communicator.h Adds nixl_available_ field and kNixl branch in isBackendAvailable(); straightforward extension of the existing UCC/NCCL pattern.
csrc/multidevice/multidevice.h Adds kNixl enumerator to CommunicatorBackend and adds an explicit underlying type (std::uint8_t); clean change.
.github/workflows/build.yml Extends CI to install NIXL and build with NVFUSER_BUILD_WITH_NIXL=1; installs NIXL sequentially after pip-install-things.sh to avoid dependency conflicts.
CMakeLists.txt Adds NVFUSER_BUILD_WITH_NIXL option, includes handle_nixl.cmake, links nixl unconditionally to NVFUSER_SRCS (compile guard via USE_NIXL definition), and adds test file; clean integration following existing UCC pattern.
python/tools/prereqs/requirements/nixl.py New Python dependency reporter for NIXL with CUDA major version constraint display; clean implementation following the existing NVMMHRequirement pattern.
python/utils.py Adds build_with_nixl flag to BuildConfig and wires it through the cmake invocation; straightforward extension matching the existing UCC pattern.

Sequence Diagram

sequenceDiagram
    participant R0 as Rank 0
    participant R1 as Rank 1
    participant Store as TCPStore
    participant NIXL as NIXL/UCX

    R0->>R0: NixlBackend::getInstance()<br/>(create nixlAgent + UCX backend)
    R1->>R1: NixlBackend::getInstance()<br/>(create nixlAgent + UCX backend)

    R0->>NIXL: registerMem(src, dst)
    R1->>NIXL: registerMem(src, dst)

    Note over R0,R1: exchangeMetadata() — collective
    R0->>Store: set("nixl_agent_md_rank_0", localMD)
    R1->>Store: set("nixl_agent_md_rank_1", localMD)
    R0->>Store: get("nixl_agent_md_rank_1")
    R1->>Store: get("nixl_agent_md_rank_0")
    R0->>NIXL: loadRemoteMD(rank1_md)
    R1->>NIXL: loadRemoteMD(rank0_md)
    R0-->>R1: barrier()
    R0->>Store: deleteKey("nixl_agent_md_rank_0")
    R1->>Store: deleteKey("nixl_agent_md_rank_1")

    R0->>NIXL: createXferReq(READ, local_dst, remote_src, "rank_1")
    R1->>NIXL: createXferReq(READ, local_dst, remote_src, "rank_0")

    R0->>NIXL: postXferReq(handle)
    R1->>NIXL: postXferReq(handle)

    NIXL-->>R0: RDMA READ from Rank 1's src
    NIXL-->>R1: RDMA READ from Rank 0's src

    R0->>NIXL: getXferStatus() → NIXL_SUCCESS
    R1->>NIXL: getXferStatus() → NIXL_SUCCESS

    Note over R0,R1: deregisterTensors() — collective<br/>(deregisterMem + exchangeMetadata + barrier)
    R0->>NIXL: deregisterMem(src, dst)
    R1->>NIXL: deregisterMem(src, dst)
Loading

Comments Outside Diff (2)

  1. tests/cpp/test_multidevice_nixl.cpp, line 176-178 (link)

    Race condition: deregisterTensors called without barrier after ring read

    In the ring-read pattern, every rank reads from its peer's src simultaneously. After rank 0's waitTransfer completes (rank 0's read of rank 1's src is done), rank N-1 may still be performing its NIXL RDMA read from rank 0's src. Immediately calling backend.deregisterTensors({dst, src}) will call agent_->deregisterMem(...) on rank 0's src while rank N-1's in-flight RDMA read is still targeting that buffer — a RDMA use-after-unregistration.

    Compare this to WriteTransferEndToEnd (line 1474) which correctly inserts communicator_->barrier() between waitTransfer and the assertion/deregister phase to ensure all peers have finished their writes before any rank deregisters.

    The fix is to add a barrier before deregisterTensors here as well:

  2. tools/install-nixl.sh, line 23 (link)

    Hardcoded nixl-cu12 forces CUDA 12 regardless of system CUDA version

    pip install --no-deps nixl nixl-cu12 explicitly installs the CUDA 12 variant unconditionally. On a CUDA 13 system, this installs CUDA 12 binaries — the cmake handle_nixl.cmake CUDA version check will emit a WARNING about the mismatch, but the install itself silently uses the wrong libraries, which can cause linking or runtime errors.

    The script comment says "pip install nixlnixl-cu12 (default) or nixl-cu13 depending on extras", so pip install nixl alone should auto-select the right CUDA variant. The nixl-cu12 explicit argument overrides this automatic selection.

    Consider detecting the CUDA major version and installing the matching package:

    CUDA_MAJOR=$(nvcc --version 2>/dev/null | grep -oP 'release \K[0-9]+' | head -1)
    CUDA_MAJOR="${CUDA_MAJOR:-12}"  # default to 12 if nvcc not found
    pip install --no-deps nixl nixl-cu${CUDA_MAJOR}

Last reviewed commit: 9847a11

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Yes, making registerTensors/deregisterTensors collective is reasonable for NIXL. Here's why:

Why it makes sense:

  1. NIXL uses one-sided RDMA - both the initiator and target need metadata about registered memory regions before transfers can happen
  2. All-to-all communication pattern - since any rank can potentially access any other rank's memory, metadata needs to be available globally
  3. Simplifies implementation - avoids complex lazy/on-demand metadata exchange logic
  4. Common pattern - most RDMA frameworks (NCCL, UCX-based systems) have similar collective registration

Trade-offs to consider:

  • ❌ Less flexible - can't have asymmetric registration patterns
  • ❌ Synchronization point - all ranks must participate even if they won't communicate
  • ✅ Simpler and more robust - no race conditions from partial metadata
  • ✅ Matches typical usage - your tests show all ranks register buffers before communication

Looking at your implementation (lines 210-268):

void registerTensors(...) {
  // ... register locally
  metadata_exchanged_ = false;
  exchangeMetadata();  // Collective: store->set(), store->get() all ranks, barrier()
}

The collective nature is enforced by the barrier at line 265, which is necessary to prevent ranks from deleting keys before others read them. This is a solid design for correctness.

Recommendation: Document this collective requirement clearly in the header comments for registerTensors/deregisterTensors so users know all ranks must call these together.

Copy link
Collaborator

@mdavis36 mdavis36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)

@x41lakazam
Copy link
Collaborator Author

!test

@samnordmann
Copy link
Collaborator

!test

@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do?
Thanks!

@samnordmann
Copy link
Collaborator

!test

@samnordmann
Copy link
Collaborator

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants