[AMD] Add AITER Custom All-Reduce #13102

hubertlu-tw · 2025-11-12T00:29:53Z

Co-author: @b8zhong, @kkHuang-amd, Alan Kao.

Motivation

Continue the work from #11484

Modifications

Add a new env var SGLANG_AITER_AR which has a default True value. To use the CustomAllReduce kernels in sgl-kernel, please use SGLANG_AITER_AR=0
Change aiter car's code path.
aiter build requires this patch: Use torch.zeros_like instead of empty_like to prevent accuracy drop in CAR ROCm/aiter#1387

Accuracy Tests

$ SGLANG_USE_AITER=1 NCCL_MIN_NCHANNELS=112 SGLANG_INT4_WEIGHT=0 SGLANG_MOE_PADDING=1 SGLANG_USE_ROCM700A=1 SGLANG_SET_CPU_AFFINITY=1 SGLANG_ROCM_FUSED_DECODE_MLA=1         SGLANG_AITER_AR=1         python3 -m sglang.launch_server     --model-path deepseek-ai/DeepSeek-R1-MXFP4-Preview/     --tensor-parallel-size 8     --trust-remote-code     --chunked-prefill-size 131072     --host 0.0.0.0     --port 8000        --mem-fraction-static 0.95     --speculative-algorithm EAGLE     --speculative-num-steps 3     --speculative-eagle-topk 1     --speculative-num-draft-tokens 4

$ python3 benchmark/gsm8k/bench_sglang.py --parallel 1400 --num-questions 1400

Accuracy: 0.942
Invalid: 0.000

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

TP=8 results from torchrun --nproc_per_node=8 benchmark/kernels/all_reduce/benchmark_aiter.py:

    Size    SGLang(ms)    Aiter(ms)
-----------------------------------
     32K         0.038        0.045
     64K         0.046        0.058
    128K         0.042        0.042
    256K         0.053        0.053
    512K         0.046        0.044
      1M         0.058        0.053
      2M         0.056        0.050
      4M         0.082        0.069
      8M         0.099        0.081
     16M         0.167        0.130
     32M         0.270        0.201
     64M         0.508        0.365

…t yet! error

gemini-code-assist · 2025-11-12T00:30:07Z

Summary of Changes

Hello @hubertlu-tw, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances SGLang's distributed capabilities by integrating AITER's custom all-reduce implementation for AMD GPUs. The primary goal is to leverage AITER's optimized kernels to improve the performance of all-reduce operations on ROCm-enabled systems. It provides a configurable way to enable or disable this new functionality and includes a dedicated benchmark to assess its impact.

Highlights

AITER Custom All-reduce Integration: Introduces support for AITER's custom all-reduce kernels, specifically for ROCm (AMD GPUs), allowing for potentially optimized distributed operations.
New Environment Variable: Adds SGLANG_AITER_AR (defaulting to true) to control whether the AITER custom all-reduce kernels are used. Setting it to 0 will disable them.
Dynamic All-reduce Dispatch: Implements a dispatch mechanism to conditionally select between SGLang's native custom all-reduce and AITER's implementation based on the environment and hardware (ROCm).
Benchmarking Script: A new benchmark script (benchmark_aiter.py) has been added to compare the performance of SGLang's custom all-reduce against AITER's across various message sizes.
Test Suite Enhancements: The custom all-reduce test suite has been updated to include larger message sizes and address potential Global server args is not set yet! errors by setting dummy server arguments.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for AITER's custom all-reduce on AMD platforms, enabled via the SGLANG_AITER_AR environment variable. The changes are well-structured, including a dispatch mechanism to select the appropriate all-reduce implementation and a comprehensive benchmark script to compare performance. The tests have also been updated accordingly. My main feedback concerns a reduction in type safety in parallel_state.py where Optional[Any] is used. I've suggested an improvement to enhance maintainability.

gemini-code-assist · 2025-11-12T00:31:10Z

python/sglang/srt/distributed/parallel_state.py

            )

-        self.ca_comm: Optional[CustomAllreduce] = None
+        self.ca_comm: Optional[Any] = None


Changing the type hint for ca_comm from Optional[CustomAllreduce] to Optional[Any] reduces type safety and maintainability. While this works because both sglang.CustomAllreduce and aiter.CustomAllreduce are expected to have a compatible interface, it would be better to define a Protocol or an abstract base class that both communicators implement. This would make the code more robust and easier to understand.

For example, you could define a protocol:

from typing import Protocol, Union, Optional from torch.distributed import ProcessGroup import torch class AllReduceCommunicator(Protocol): def __init__(self, group: ProcessGroup, device: Union[int, str, torch.device], **kwargs): ... def custom_all_reduce(self, input: torch.Tensor) -> Optional[torch.Tensor]: ... def should_custom_ar(self, inp: torch.Tensor) -> bool: ... def close(self) -> None: ... # Add other common methods like capture if needed # Then use it as: self.ca_comm: Optional[AllReduceCommunicator] = None

This would enforce that any class assigned to ca_comm has the required methods, improving static analysis and preventing potential runtime errors.

b8zhong and others added 8 commits November 12, 2025 00:01

more

6d3d252

more

7e4d45b

more

7d2f38a

more

ac8c131

more

3c3091a

Update aiter car's interface and add USE_AITER_CAR env var

2e3c89d

Update test_custom_allreduce.py to avoid Global server args is not se…

27596e3

…t yet! error

Change env var from USE_AITER_CAR to SGLANG_AITER_AR

e61d8aa

hubertlu-tw requested review from HaiShaw and b8zhong November 12, 2025 00:29

hubertlu-tw requested review from merrymercy and yizhang2077 as code owners November 12, 2025 00:29

github-actions bot added the performance label Nov 12, 2025

sglang-bot added the run-ci label Nov 12, 2025

hubertlu-tw added amd and removed performance run-ci labels Nov 12, 2025

hubertlu-tw changed the title ~~[AMD] Add AITER Custom All-reduce~~ [AMD] Add AITER Custom All-Reduce Nov 12, 2025

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

b8zhong mentioned this pull request Nov 12, 2025

[ROCm] AITER Custom All-reduce #11484

Closed

Adjust ENV name to enable aiter custom all reduce

c0717ec

github-actions bot added the performance label Nov 12, 2025

Merge branch 'main' into aiter_comm_1105

d4dde66

HaiShaw added the run-ci label Nov 12, 2025

HaiShaw approved these changes Nov 12, 2025

View reviewed changes

hubertlu-tw and others added 2 commits November 12, 2025 10:46

Update run_suite.py

e9e10b1

Relax latency difference threshod in test_hellaswag_select for hip

d102bfd

HaiShaw merged commit e4b2937 into sgl-project:main Nov 13, 2025
76 of 79 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD] Add AITER Custom All-Reduce #13102

[AMD] Add AITER Custom All-Reduce #13102

Uh oh!

hubertlu-tw commented Nov 12, 2025

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[AMD] Add AITER Custom All-Reduce #13102

[AMD] Add AITER Custom All-Reduce #13102

Uh oh!

Conversation

hubertlu-tw commented Nov 12, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants