Open
Conversation
The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as User Buffer (UB), symmetric memory, and the device API.
GPU-Initiated Networking (GIN):
* Provides device-side API for integrating GPU-Initiated Networking
capability into application kernels.
* New transport layer called DOCA GPUNetIO.
* New ncclGin construct to create, destroy and manipulate GIN contexts.
* New ncclGinBarrierSession to provide synchronization functionality.
* New put, signal, counter operations for data movement and signaling.
* GIN API signatures and functionalities are subject to change.
* GIN Support Requirements
* CUDA 12.2 or later when compiling the GPU code
* NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
* NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
* Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
kernel >= 6.1 is required.
New ncclCommRevoke API for fault tolerance:
* Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
communicator without freeing resources.
* This answers the need for a lightweight way to cancel in-flight
collectives and bring a communicator to a safe state before
split/shrink/finalize/destroy.
* Includes optional cross-rank coordination (global barrier) and
supports blocking/non-blocking usage.
New NCCL Environment Plugin:
* The env plugin allows users to set NCCL environment variables, for
example, after loading them from a centralized database.
* The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
environment plugin.
New NCCL Examples on GitHub:
* The NCCL examples directory provides users and developers with
practical code samples that highlight NCCL’s core features.
* It covers basic operations like communicator initialization,
point-to-point communication, and collective operations, as well as
advanced features such as user buffer registration, symmetric memory,
and the device API.
Device API improvements:
* Adds ncclFindWindow API.
* Adds new ncclBarrierSession to provide hybrid synchronization
functionality.
* Makes multimem available with as few as two ranks.
* Removes distance (NCCL_P2P_LEVEL) considerations from determining the
availability of symmetric memory.
Enhanced NCCL RAS output:
* Extends RAS subsystem with JSON format to support machine-parsable
metrics collection.
* Enables structured data export for monitoring tools, dashboards, and
automated analysis systems.
Github Pull Requests resolved:
* Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
(PR NVIDIA#1789)
* Fast Init - Improve Bootstrap AllGather by 2x at large scale by
sending bootstrap information bidirectionally. (PR NVIDIA#1791)
* Fixes spurious failures when PyTorch is statically linked with
NCCL-2.28.3 because error is not drained, but rather gets propagated
into the next CUDA kernel invocation. (PR NVIDIA#1864)
Other notable improvements:
* Fixes multicast object leaks in case of failed NVLS user buffer
registrations, which could lead to crashes. Avoids such registration
attempts in case of the use of incompatible memory allocators.
* Fixes potential data corruption with built-in symmetric kernels for
small messages with size granularity under 8 bytes or when multiple
symmetric operations were aggregated in a group.
* Generalizes the existing point-to-point scheduling to the case of
un-even GPU count per node.
* Fixes a crash when network plugin assignment fails.
* Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
split mask settings, where NCCL cannot find a viable ring.
* Fixes crash when NCCL is compiled with recent CUDA versions but
running on hosts with certain specific older CUDA drivers.
Added a pure GIN all2all example and a Hybrid GIN/LSA one
Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA#1893), a bug fix in GIN.
Device API Improvements: - Supports Device API struct versioning for backwards compatibility with future versions. - Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm. - Adds host-accessible device pointer functions from symmetric registered ncclWindows. - Adds improved GIN documentation to clarify the support matrix. New One-Sided Host APIs: - Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM. - One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process. - Utilizes CopyEngine for NVL transfer and CPU proxy for network. - Requires CUDA 12.5 or greater. New Experimental Python language binding (NCCL4Py): - Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations. - Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy. - Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations). New LLVM intermediate representation (IR) support: - Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems. - Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL). - Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode. - Requires CUDA 12 and Clang 21. Built-in hybrid (LSA+GIN) symmetric kernel for AllGather: - Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather. - Requires symmetric memory registration and GIN. New ncclCommGrow API: - Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator. - Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes. - Also addresses the need for elastic applications to expand a running job by integrating new ranks. Multi-segment registration: - Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports. - Enables support for expandable segments in PyTorch. Improves scalability of AllGatherV pattern: - Adds support for a scalable allgatherv pattern (group of broadcasts). - Adds new scheduler path and new kernels to improve performance at large scale. Debuggability & Observability Improvements: - RAS supports realtime monitoring to continuously track peer status changes. - Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format. - Adds profiler support for CopyEngine(CE) based collectives. Community Engagement: - Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md - Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR NVIDIA#1759) - Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR NVIDIA#1881) - Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue NVIDIA#1859) - Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue NVIDIA#1876) - Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402) Other Improvements: - Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives. - Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance. - Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce. - Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well. - Prints git branch and commit checksum at the INFO level during NCCL initialization. - Improves support for symmetric window registrations on CUDA versions prior to 12.1. - Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible. - All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy. - Fixes a hang on GB200/300 + CX8 when the user disables GDR. - Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”. - ncclCommWindowRegister will now return a NULL window if the system does not support window registration. - More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2. - Upgrades to doca gpunetio v1.1. Known Limitations: - Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29. - One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support. - The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release. - NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.
cuda.core 0.5.0 removed "experimental" in the module path, and added expermental/init.py for compatibility, but cuda.core.experimental._stream.IsStreamT and cuda.core.experimental._memory.DevicePointerT are not included, leading to compatibility issue.
Fixes CAS usage in case of weak failure which was causing a hang on ARM. The issue affects NCCL when compiled with gcc versions prior to 10.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.