bwl1289/chore/Nvidia upstream master 02 26 by BwL1289 · Pull Request #9 · eugo-inc/nccl-cmake

BwL1289 · 2026-02-09T22:12:04Z

No description provided.

The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as User Buffer (UB), symmetric memory, and the device API.

GPU-Initiated Networking (GIN): * Provides device-side API for integrating GPU-Initiated Networking capability into application kernels. * New transport layer called DOCA GPUNetIO. * New ncclGin construct to create, destroy and manipulate GIN contexts. * New ncclGinBarrierSession to provide synchronization functionality. * New put, signal, counter operations for data movement and signaling. * GIN API signatures and functionalities are subject to change. * GIN Support Requirements * CUDA 12.2 or later when compiling the GPU code * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3 * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0 * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux kernel >= 6.1 is required. New ncclCommRevoke API for fault tolerance: * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a communicator without freeing resources. * This answers the need for a lightweight way to cancel in-flight collectives and bring a communicator to a safe state before split/shrink/finalize/destroy. * Includes optional cross-rank coordination (global barrier) and supports blocking/non-blocking usage. New NCCL Environment Plugin: * The env plugin allows users to set NCCL environment variables, for example, after loading them from a centralized database. * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external environment plugin. New NCCL Examples on GitHub: * The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. * It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as user buffer registration, symmetric memory, and the device API. Device API improvements: * Adds ncclFindWindow API. * Adds new ncclBarrierSession to provide hybrid synchronization functionality. * Makes multimem available with as few as two ranks. * Removes distance (NCCL_P2P_LEVEL) considerations from determining the availability of symmetric memory. Enhanced NCCL RAS output: * Extends RAS subsystem with JSON format to support machine-parsable metrics collection. * Enables structured data export for monitoring tools, dashboards, and automated analysis systems. Github Pull Requests resolved: * Fast Init - CPU Optimizations for NCCL Initialization Large Scale. (PR NVIDIA#1789) * Fast Init - Improve Bootstrap AllGather by 2x at large scale by sending bootstrap information bidirectionally. (PR NVIDIA#1791) * Fixes spurious failures when PyTorch is statically linked with NCCL-2.28.3 because error is not drained, but rather gets propagated into the next CUDA kernel invocation. (PR NVIDIA#1864) Other notable improvements: * Fixes multicast object leaks in case of failed NVLS user buffer registrations, which could lead to crashes. Avoids such registration attempts in case of the use of incompatible memory allocators. * Fixes potential data corruption with built-in symmetric kernels for small messages with size granularity under 8 bytes or when multiple symmetric operations were aggregated in a group. * Generalizes the existing point-to-point scheduling to the case of un-even GPU count per node. * Fixes a crash when network plugin assignment fails. * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain split mask settings, where NCCL cannot find a viable ring. * Fixes crash when NCCL is compiled with recent CUDA versions but running on hosts with certain specific older CUDA drivers.

Added a pure GIN all2all example and a Hybrid GIN/LSA one

Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA#1893), a bug fix in GIN.

Device API Improvements: - Supports Device API struct versioning for backwards compatibility with future versions. - Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm. - Adds host-accessible device pointer functions from symmetric registered ncclWindows. - Adds improved GIN documentation to clarify the support matrix. New One-Sided Host APIs: - Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM. - One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process. - Utilizes CopyEngine for NVL transfer and CPU proxy for network. - Requires CUDA 12.5 or greater. New Experimental Python language binding (NCCL4Py): - Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations. - Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy. - Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations). New LLVM intermediate representation (IR) support: - Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems. - Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL). - Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode. - Requires CUDA 12 and Clang 21. Built-in hybrid (LSA+GIN) symmetric kernel for AllGather: - Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather. - Requires symmetric memory registration and GIN. New ncclCommGrow API: - Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator. - Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes. - Also addresses the need for elastic applications to expand a running job by integrating new ranks. Multi-segment registration: - Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports. - Enables support for expandable segments in PyTorch. Improves scalability of AllGatherV pattern: - Adds support for a scalable allgatherv pattern (group of broadcasts). - Adds new scheduler path and new kernels to improve performance at large scale. Debuggability & Observability Improvements: - RAS supports realtime monitoring to continuously track peer status changes. - Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format. - Adds profiler support for CopyEngine(CE) based collectives. Community Engagement: - Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md - Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR NVIDIA#1759) - Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR NVIDIA#1881) - Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue NVIDIA#1859) - Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue NVIDIA#1876) - Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402) Other Improvements: - Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives. - Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance. - Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce. - Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well. - Prints git branch and commit checksum at the INFO level during NCCL initialization. - Improves support for symmetric window registrations on CUDA versions prior to 12.1. - Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible. - All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy. - Fixes a hang on GB200/300 + CX8 when the user disables GDR. - Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”. - ncclCommWindowRegister will now return a NULL window if the system does not support window registration. - More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2. - Upgrades to doca gpunetio v1.1. Known Limitations: - Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29. - One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support. - The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release. - NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.

cuda.core 0.5.0 removed "experimental" in the module path, and added expermental/init.py for compatibility, but cuda.core.experimental._stream.IsStreamT and cuda.core.experimental._memory.DevicePointerT are not included, leading to compatibility issue.

Fixes CAS usage in case of weak failure which was causing a hang on ARM. The issue affects NCCL when compiled with gcc versions prior to 10.

marksantesson and others added 13 commits September 24, 2025 13:00

Remove the github actions to auto-close older issues

834ef72

Added GIN examples

b17addf

Added a pure GIN all2all example and a Hybrid GIN/LSA one

Add Contribution Guide to GitHub

dd8446f

NCCL 2.28.9-1

dbc86fd

Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA#1893), a bug fix in GIN.

Add inspector's extract_git_version.sh to fix build issue

59242d7

Add examples, ext-env, ext_mixed, and ext-profiler files

1e0c869

NCCL v2.29.3-1 Release

25368a7

Fixes CAS usage in case of weak failure which was causing a hang on ARM. The issue affects NCCL when compiled with gcc versions prior to 10.

feat: checkkpoint

e5d9a7a

feat: checkkpoint

60a0804

BwL1289 changed the title ~~Nvidia upstream master 02 26~~ bwl1289/chore/Nvidia upstream master 02 26 Feb 9, 2026

BwL1289 added 2 commits February 10, 2026 17:46

chore: done merging upstream and verifying ABI compatibility

efd933c

chore: finishing up

1e4ac16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bwl1289/chore/Nvidia upstream master 02 26#9

bwl1289/chore/Nvidia upstream master 02 26#9
BwL1289 wants to merge 15 commits intomasterfrom
NVIDIA-upstream-master-02-26

BwL1289 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BwL1289 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants