Skip to content

Reduce the overhead of nccl symmetric memory#12524

Merged
merrymercy merged 11 commits intomainfrom
lianmin/symm-memory
Nov 3, 2025
Merged

Reduce the overhead of nccl symmetric memory#12524
merrymercy merged 11 commits intomainfrom
lianmin/symm-memory

Conversation

@merrymercy
Copy link
Contributor

@merrymercy merrymercy commented Nov 2, 2025

Co-authored-by: Nicolas Castet ncastet@nvidia.com

some issues:

  1. symmetric memory is slower than custom allreduce at small batch size (1). symmetric memory is faster at large batch size (128).
  2. symmetric memory is not compatbile with deepgemm. When I disable deepgemm by SGLANG_ENABLE_JIT_DEEPGEMM=0 , it can run. Otherwise, it hangs during initialization.
  3. when I use symmetric memory. In the torch profiler, you can see it creates a new cuda stream for every layer. (can be a bug of pytorch)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @merrymercy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the utilization of NCCL symmetric memory within the SGLang framework. It achieves this by integrating memory registration directly into the C-level allocation process, refining CUDA stream management for better performance, and simplifying the Python-side context manager for symmetric memory. These changes lead to a more efficient and robust distributed computing setup, particularly benefiting models like DeepseekV2 by streamlining their memory operations.

Highlights

  • NCCL Symmetric Memory Optimization: The core change involves reducing overhead for NCCL symmetric memory by modifying the underlying C allocation function (nccl_alloc_plug) to directly register memory with ncclCommWindowRegister during the allocation process, rather than performing this step post-allocation in Python.
  • Stream Management Refinement: A new utility function, get_current_device_stream_fast(), has been introduced to optimize the retrieval of the current CUDA stream. This function caches the device index, replacing multiple instances of torch.cuda.current_stream() and torch.get_device_module().current_stream() across the codebase for improved performance.
  • Simplified use_symmetric_memory Context Manager: The use_symmetric_memory context manager has been refactored for simplicity and efficiency. It now passes the NCCL communicator to the C allocation function via an environment variable (SGLANG_TMP_NCCL_COMM_VALUE), eliminating the need for explicit register_comm_window_raw calls in Python and streamlining graph capture logic by removing PyTorch version-specific checks.
  • Environment Variable Handling: Updates to _set_envs_and_config ensure that NCCL_CUMEM_ENABLE and NCCL_NVLS_ENABLE environment variables are correctly and robustly configured, especially when symmetric memory is enabled, to facilitate proper NCCL behavior.
  • DeepseekV2 Model Integration: The DeepseekV2 model's forward pass has been updated to remove explicit use_symmetric_memory context and sm.tag calls, simplifying tensor additions. This suggests a more integrated or lower-level handling of symmetric memory, leveraging the optimizations introduced elsewhere.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to reduce the overhead of NCCL symmetric memory by moving memory registration into a custom C++ allocator and using a faster method to get the current CUDA stream. The changes are generally positive for performance. However, I've identified some critical issues related to error handling in the new C++ code and potential bugs in the new fast stream-getting utility. These should be addressed to ensure the stability and correctness of the implementation.

Co-authored-by: Nicolas Castet <ncastet@nvidia.com>
@merrymercy merrymercy merged commit 7a21d8b into main Nov 3, 2025
106 of 115 checks passed
@merrymercy merrymercy deleted the lianmin/symm-memory branch November 3, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments