feat: integrate deepgemm into EPMoE by TianQiLin666666 · Pull Request #5805 · sgl-project/sglang

TianQiLin666666 · 2025-04-28T03:31:36Z

Motivation

For normal EPMoE (no DeepEP), integrate DeepGEMM as an option.

Modifications

Add forward_deepgemm in EPMoE. Use env EPMOE_USE_DEEPGEMM to enable it.
Add some triton kernels for PreRecord and PostRecord of forward_deepgemm.

Evaluation

Speed

With 2H20-96G8 for EP16, enabling EPMOE_USE_DEEPGEMM leads to a 14% throughput gain.

Disable EPMOE_USE_DEEPGEMM

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max reqeuest concurrency:                64
Successful requests:                     192
Benchmark duration (s):                  280.53
Total input tokens:                      672000
Total generated tokens:                  288000
Total generated tokens (retokenized):    285278
Request throughput (req/s):              0.68
Input token throughput (tok/s):          2395.46
Output token throughput (tok/s):         1026.63
Total token throughput (tok/s):          3422.09
Concurrency:                             53.17
Accept length:                           3.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   77682.09
Median E2E Latency (ms):                 79075.32
---------------Time to First Token----------------
Mean TTFT (ms):                          4389.81
Median TTFT (ms):                        758.88
P99 TTFT (ms):                           21329.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.90
Median ITL (ms):                         34.40
P95 ITL (ms):                            121.80
P99 ITL (ms):                            264.88
Max ITL (ms):                            21210.85
==================================================

Enable EPMOE_USE_DEEPGEMM

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max reqeuest concurrency:                64
Successful requests:                     192
Benchmark duration (s):                  246.03
Total input tokens:                      672000
Total generated tokens:                  288000
Total generated tokens (retokenized):    285757
Request throughput (req/s):              0.78
Input token throughput (tok/s):          2731.37
Output token throughput (tok/s):         1170.59
Total token throughput (tok/s):          3901.96
Concurrency:                             58.55
Accept length:                           3.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75031.17
Median E2E Latency (ms):                 75272.64
---------------Time to First Token----------------
Mean TTFT (ms):                          4070.91
Median TTFT (ms):                        597.56
P99 TTFT (ms):                           19711.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.35
Median ITL (ms):                         34.07
P95 ITL (ms):                            118.96
P99 ITL (ms):                            244.79
Max ITL (ms):                            19567.36
==================================================

Launch server commands

# node0
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 44 48 52 56 64 66 68 70 72 74 76 --cuda-graph-max-bs 78 \
--attention-backend fa3 \
--speculative-algo NEXTN --speculative-draft /data/models/DeepSeek-R1-NextN --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 6 \
--model-path /data/models/DeepSeek-R1/ \
--tp 16 \
--dist-init-addr 192.168.0.7:10240 \
--nnodes 2 --node-rank 0 --trust-remote-code \
--host 0.0.0.0 --port 8000 --enable-ep-moe --max-running-requests 78 --mem-fraction-static 0.75 --disable-chunked-prefix-cache

# node1
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 44 48 52 56 64 66 68 70 72 74 76 --cuda-graph-max-bs 78 \
--attention-backend fa3 \
--speculative-algo NEXTN --speculative-draft /data/models/DeepSeek-R1-NextN --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 6 \
--model-path /data/models/DeepSeek-R1/ \
--tp 16 \
--dist-init-addr 192.168.0.7:10240 \
--nnodes 2 --node-rank 1 --trust-remote-code \
--host 0.0.0.0 --port 8000 --enable-ep-moe --max-running-requests 78 --mem-fraction-static 0.75 --disable-chunked-prefix-cache

# client
python3 -m sglang.bench_serving --backend sglang \
            --dataset-name random \
            --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
            --random-input-len 3500 \
            --random-output-len 1500 \
            --random-range-ratio 1 \
            --request-rate 74 \
            --max-concurrency 74 \
            --num-prompts 296 \
            --host 0.0.0.0 --port 8000

Accuracy

MMLU test with mmlu/bench_sglang.py

100%|██████████████████████████████████████████████████████████████████| 1369/1369 [00:47<00:00, 28.74it/s]
subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.852
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.925
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.890
Total latency: 47.643
Average accuracy: 0.865

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

xutizhou · 2025-04-28T05:46:11Z

python/sglang/srt/layers/moe/ep_moe/layer.py


 logger = logging.getLogger(__name__)

+epmoe_use_deepgemm = get_bool_env_var("EPMOE_USE_DEEPGEMM")


sglang/python/sglang/srt/layers/quantization/deep_gemm.py

Line 29 in b5be569

_ENABLE_JIT_DEEPGEMM = True

We might import it directly.

So, do you mean we just replace EPMOE_USE_DEEPGEMM with _ENABLE_JIT_DEEPGEMM

Yes, enabling _ENABLE_JIT_DEEPGEMM will set deepgemm at epmoe as the default configuration.

xutizhou · 2025-04-28T05:49:01Z

python/sglang/srt/layers/moe/ep_moe/layer.py

+        )
+
    def forward(self, hidden_states: torch.Tensor, router_logits: torch.Tensor):
+        if use_deep_gemm and epmoe_use_deepgemm:


Why disable EPMOE DeepGEMM when use_deep_gemm is enabled?

Maybe forward_deepgemm is called when use_deep_gemm is enabled.

Are there any cases where Triton GEMM in forward_normal outperforms DeepGEMM?

As for now, I didn't find any case where Triton GEMM in forward_normal outperforms DeepGEMM, but DeepGEMM may occupy more GPU memory.

We could remove epmoe_use_deepgemm and corresponding Environment variable EPMOE_USE_DEEPGEMM for the sake of clarity.

TianQiLin666666 · 2025-04-29T11:39:35Z

@xutizhou Could you please help me merge this

xutizhou · 2025-04-30T06:26:18Z

@xutizhou Could you please help me merge this

Sure, I need some time to review and test.

zhyncs · 2025-05-31T01:53:05Z

Hi @TianQiLin666666 May you help fix the conflicts? Thanks!

xutizhou · 2025-06-02T10:10:13Z

python/sglang/srt/layers/moe/ep_moe/kernels.py

+
+
+def exp2_upper(num: int) -> int:
+    for i in range(2, 31):


why does the variable num start from 2**2=4

zhyncs · 2025-06-04T09:26:40Z

Hi @TianQiLin666666 Thanks for the great work! Due to this PR not being updated for a while, and our high interest in this feature, I have asked @xutizhou to make some fixes and optimizations based on your work in a new PR #6821. We will add you as a co-author in the new PR. Thank you for your understanding and help.

TianQiLin666666 added 6 commits April 22, 2025 16:50

feat(ep_moe): integrate deepgemm into origin ep moe

92d647c

fix(ep_moe): group_gemm_mask bug

e057acb

fix bugs

19ec50e

fix bugs

3ce1a91

fix(em_moe): offset bugs

3d51a71

fix(deepgemm): bugfix

c80fc3c

TianQiLin666666 requested review from HaiShaw, Ying1123, ispobock, merrymercy and zhyncs as code owners April 28, 2025 03:31

fix: remove redundant code

af94a8b

zhyncs assigned ch-wan, sleepcoo and xutizhou Apr 28, 2025

TianQiLin666666 added 2 commits April 28, 2025 11:43

fix: clang-format

2022070

fix: remove print

55ea483

xutizhou reviewed Apr 28, 2025

View reviewed changes

fix(ep_moe): replace EPMOE_USE_DEEPGEMM with _ENABLE_JIT_DEEPGEMM

988a522

zhyncs added the high priority label May 31, 2025

xutizhou reviewed Jun 2, 2025

View reviewed changes

xutizhou mentioned this pull request Jun 3, 2025

feat: integrate deepgemm into EPMoE #6821

Merged

6 tasks

zhyncs closed this Jun 4, 2025

HanHan009527 deleted the feat/ep_moe_deepgemm branch December 16, 2025 16:20


		logger = logging.getLogger(__name__)

		epmoe_use_deepgemm = get_bool_env_var("EPMOE_USE_DEEPGEMM")

Conversation

TianQiLin666666 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Evaluation

Speed

Accuracy

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TianQiLin666666 commented Apr 29, 2025

Uh oh!

xutizhou commented Apr 30, 2025

Uh oh!

zhyncs commented May 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TianQiLin666666 commented Apr 28, 2025 •

edited

Loading