Reduce the overhead of nccl symmetric memory by merrymercy · Pull Request #12524 · sgl-project/sglang

merrymercy · 2025-11-02T23:47:32Z

directly register the buffer in c++ code instead of calling get_nccl_mem_pool().snapshot(). This reduces the CPU overhead.
apply some changes from Use NCCL symmetric memory for DP (includes allgather, fp4 allgatherv, and reducescatter) #9358. I ported the clean up for the TP part. The support of dp attention will be in another PR.

Co-authored-by: Nicolas Castet ncastet@nvidia.com

some issues:

symmetric memory is slower than custom allreduce at small batch size (1). symmetric memory is faster at large batch size (128).
symmetric memory is not compatbile with deepgemm. When I disable deepgemm by SGLANG_ENABLE_JIT_DEEPGEMM=0 , it can run. Otherwise, it hangs during initialization.
when I use symmetric memory. In the torch profiler, you can see it creates a new cuda stream for every layer. (can be a bug of pytorch)

gemini-code-assist · 2025-11-02T23:47:51Z

Summary of Changes

Hello @merrymercy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the utilization of NCCL symmetric memory within the SGLang framework. It achieves this by integrating memory registration directly into the C-level allocation process, refining CUDA stream management for better performance, and simplifying the Python-side context manager for symmetric memory. These changes lead to a more efficient and robust distributed computing setup, particularly benefiting models like DeepseekV2 by streamlining their memory operations.

Highlights

NCCL Symmetric Memory Optimization: The core change involves reducing overhead for NCCL symmetric memory by modifying the underlying C allocation function (nccl_alloc_plug) to directly register memory with ncclCommWindowRegister during the allocation process, rather than performing this step post-allocation in Python.
Stream Management Refinement: A new utility function, get_current_device_stream_fast(), has been introduced to optimize the retrieval of the current CUDA stream. This function caches the device index, replacing multiple instances of torch.cuda.current_stream() and torch.get_device_module().current_stream() across the codebase for improved performance.
Simplified use_symmetric_memory Context Manager: The use_symmetric_memory context manager has been refactored for simplicity and efficiency. It now passes the NCCL communicator to the C allocation function via an environment variable (SGLANG_TMP_NCCL_COMM_VALUE), eliminating the need for explicit register_comm_window_raw calls in Python and streamlining graph capture logic by removing PyTorch version-specific checks.
Environment Variable Handling: Updates to _set_envs_and_config ensure that NCCL_CUMEM_ENABLE and NCCL_NVLS_ENABLE environment variables are correctly and robustly configured, especially when symmetric memory is enabled, to facilitate proper NCCL behavior.
DeepseekV2 Model Integration: The DeepseekV2 model's forward pass has been updated to remove explicit use_symmetric_memory context and sm.tag calls, simplifying tensor additions. This suggests a more integrated or lower-level handling of symmetric memory, leveraging the optimizations introduced elsewhere.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to reduce the overhead of NCCL symmetric memory by moving memory registration into a custom C++ allocator and using a faster method to get the current CUDA stream. The changes are generally positive for performance. However, I've identified some critical issues related to error handling in the new C++ code and potential bugs in the new fast stream-getting utility. These should be addressed to ensure the stability and correctness of the implementation.

python/sglang/srt/distributed/device_communicators/pynccl_allocator.py

python/sglang/srt/utils/common.py

Co-authored-by: Nicolas Castet <ncastet@nvidia.com>

merrymercy added 6 commits November 2, 2025 13:12

clean up symmetric memory in deepseek v2.py

0c6e67e

Fix nccl symmetirc memory for tp

c6adb5b

only support torch 2.8

7baf013

Fix

5efe38a

reduce the overhead

fbd0383

reduce the overhead of get_current_stream

a69b8c5

merrymercy requested review from BBuf, CatherineSue, Edwardf0t1, HaiShaw, JustinTong0323, Ying1123, ch-wan, ispobock, kushanam, slin1237, yizhang2077 and zhyncs as code owners November 2, 2025 23:47

sglang-bot added the run-ci label Nov 2, 2025

gemini-code-assist bot reviewed Nov 2, 2025

View reviewed changes

python/sglang/srt/distributed/device_communicators/pynccl_allocator.py Show resolved Hide resolved

python/sglang/srt/distributed/device_communicators/pynccl_allocator.py Show resolved Hide resolved

python/sglang/srt/utils/common.py Show resolved Hide resolved

Manually tag all moe kernels

0bef6da

Co-authored-by: Nicolas Castet <ncastet@nvidia.com>

merrymercy requested a review from FlamingoPg as a code owner November 3, 2025 12:14

merrymercy added 4 commits November 3, 2025 12:16

Fix vocab_parallel_embedding

459e0ad

Fix moe

50c31b6

remove minor overhead

5d8315e

Fix lint

a37ea7b

merrymercy merged commit 7a21d8b into main Nov 3, 2025
106 of 115 checks passed

merrymercy deleted the lianmin/symm-memory branch November 3, 2025 19:56

nvcastet mentioned this pull request Nov 4, 2025

Register allgather/reducescatter buffers with symm memory #12572

Merged

4 tasks

elvischenv mentioned this pull request Nov 6, 2025

[Bugfix] Fix illegal memory access #12758

Merged

4 tasks

b8zhong mentioned this pull request Nov 7, 2025

use fast stream instead of torch.cuda.current_stream in llama 4 shared experts overlap #12811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the overhead of nccl symmetric memory#12524

Reduce the overhead of nccl symmetric memory#12524
merrymercy merged 11 commits intomainfrom
lianmin/symm-memory

merrymercy commented Nov 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

merrymercy commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

merrymercy commented Nov 2, 2025 •

edited

Loading