optimize get_topk_ragged by fusing get k and k_scale triton kernel by BJWang-ant · Pull Request #16043 · sgl-project/sglang

BJWang-ant · 2025-12-29T06:26:55Z

Summary

Optimize the get_index_k_scale_buffer function to reduce the number of concatenation operations of k_fp8 and k_scale.

gemini-code-assist · 2025-12-29T06:26:57Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BJWang-ant · 2026-01-09T08:29:34Z

The above commit splits the for loops for getting k\scale and ke\ks into two separate kernels.Judging solely from my test results。
From the trace graph, it can be seen that when ISL = 1600 + 3700, the processing time can be reduced from 420 us to approximately 21 us.

command:

PD separation：2P1D，Prefill is B200，Decode is H200.
Prefill server command：
NVSHMEM_IB_GID_INDEX=3
TORCH_CUDA_ARCH_LIST="9.0"
NVSHMEM_IB_TRAFFIC_CLASS=184
MC_SLICE_SIZE=262144
GLOO_SOCKET_IFNAME=eth4
NCCL_SOCKET_IFNAME=eth4
MC_TE_METRIC=true
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
SGLANG_DG_CACHE_DIR=xxxxx
python3 -m sglang.launch_server
--model-path /deepseek-ai/DeepSeek-V3.2
--disaggregation-mode prefill
--disaggregation-ib-device mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11
--dist-init-addr $p0:5757
--nnodes 1
--node-rank $pi
--enable-dp-attention
--tp 8
--dp 2
--ep 8
--enable-dp-attention
--enable-metrics
--enable-metrics-for-all-schedulers
--enable-expert-distribution-metrics
--decode-log-interval 1
--moe-a2a-backend deepep
--host 0.0.0.0
--port 8000
--trust-remote-code
--moe-dense-tp-size 1
--enable-dp-lm-head
--enable-cache-report
--watchdog-timeout 1000000
--deepep-mode normal
--mem-fraction-static 0.85
--max-running-requests 512
--chunked-prefill-size 32768
--max-prefill-tokens 32768
--enable-eplb
--eplb-rebalance-num-iterations 1000
--ep-dispatch-algorithm dynamic
--eplb-algorithm deepseek
--expert-distribution-recorder-mode stat
--deepep-config /configs/deepep.json
--page-size 64
--log-level debug
--context-length 23000
Decode server command
GLOO_SOCKET_IFNAME=eth1
GLOO_SOCKET_TIMEOUT_MS=60000
TORCH_CUDA_ARCH_LIST="9.0"
NCCL_SOCKET_IFNAME=eth1
MC_TE_METRIC=true
SGLANG_TBO_DEBUG=1
NVSHMEM_HCA_LIST=mlx5_4,mlx5_5,mlx5_6,mlx5_7
NVSHMEM_IB_GID_INDEX=3
NVSHMEM_IB_ENABLE_IBGDA=true
NVSHMEM_IB_TRAFFIC_CLASS=184
NVSHMEM_BOOTSTRAP_UID_SOCK_FAMILY=AF_INET
NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=eth1
DEEPEP_DIAGNOSE_INTERVAL=120
DEEPEP_DIAGNOSE_LOG_DETAILS=1
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
SGLANG_DG_CACHE_DIR=xxxxxxx
python3 -m sglang.launch_server
--model-path /deepseek-ai/DeepSeek-V3.2
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
--disaggregation-transfer-backend mooncake
--dist-init-addr $d0:5757
--nnodes 1
--node-rank $di
--tp 8
--ep 8
--dp 8
--enable-dp-attention
--decode-log-interval 1
--enable-metrics
--enable-metrics-for-all-schedulers
--host 0.0.0.0
--port 8000
--trust-remote-code
--moe-dense-tp-size 1
--enable-dp-lm-head
--enable-cache-report
--disaggregation-mode decode
--watchdog-timeout 1000000
--deepep-mode low_latency
--mem-fraction-static 0.85
--max-running-requests 256
--context-length 23000
--nsa-decode-backend 'flashmla_kv'
--quantization 'fp8'
--kv-cache-dtype 'fp8_e4m3'
--cuda-graph-max-bs 32
--moe-a2a-backend deepep
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-flashinfer-allreduce-fusion
--speculative-attention-mode decode
--prefill-round-robin-balance
--page-size 64
--log-level debug
--load-balance-method round_robin
bench-mark command：
python3 -m sglang.bench_serving --model /deepseek-ai/DeepSeek-V3.2 --base-url http://xxxx:xxxxx --dataset-name generated-shared-prefix --gsp-num-groups 2 --gsp-prompts-per-group 5000 --gsp-system-prompt-len 1600 --gsp-question-len 3700 --gsp-output-len 200 --warmup-requests 100 --max-concurrency 64 --dataset-path xxxx

BJWang-ant · 2026-01-11T05:07:16Z

Please help me review code. Thanks!

Fridge003 · 2026-01-23T16:56:39Z

@BJWang-ant Please fix conflict

Fridge003 · 2026-01-23T17:30:42Z

Please post the GPQA/AIME2025 results as the instructions here
https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond

BJWang-ant · 2026-02-02T07:34:11Z

@BJWang-ant Can you please post the GPQA result?

Also are you posting V3.2 and V3.2 speciale in reverse order? I feel the result for V3.2 speciale is not correct (speciale doesn't use tool call, and needs 120k output length)

If V3.2 (not speciale) reaches 93, I think this PR is correct, then we can ignore the result on speciale

I will test again, and post the GPQA result.

Fridge003 · 2026-02-02T14:41:48Z

Please fix lint with

pre-commit install
pre-commit run --all-files

BJWang-ant · 2026-02-03T06:12:15Z

@BJWang-ant Can you please post the GPQA result?

Also are you posting V3.2 and V3.2 speciale in reverse order? I feel the result for V3.2 speciale is not correct (speciale doesn't use tool call, and needs 120k output length)

If V3.2 (not speciale) reaches 93, I think this PR is correct, then we can ignore the result on speciale

I test AIME2025 again，the same result was obtained.
this is GPQA result:

Fridge003 · 2026-02-03T16:29:30Z

/tag-and-rerun-ci

…gl-project#16043) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>

Jacob0226 · 2026-02-09T02:55:35Z

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens):
Triton Error [HIP]: Code: 1, Messsage: invalid argument
This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.

Server:

python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang

Client:

python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

BJWang-ant · 2026-02-09T03:02:32Z

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens): Triton Error [HIP]: Code: 1, Messsage: invalid argument This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.

Server:
python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
Client:
python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

Yeah, we find this question too. I will address this in the next PR.

BJWang-ant · 2026-02-09T03:15:18Z

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens): Triton Error [HIP]: Code: 1, Messsage: invalid argument This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.

Server:
python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
Client:
python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

I will push code either today or tomorrow.

Jacob0226 · 2026-02-09T03:50:57Z

@BJWang-ant Thanks for the update. The snapshot looks like you have another PR page. Could you share the PR link so I can run some tests once it’s merged?

BJWang-ant · 2026-02-09T03:55:17Z

@BJWang-ant Thanks for the update. The snapshot looks like you have another PR page. Could you share the PR link so I can run some tests once it’s merged?

I'm not quite sure which PR you are referring to.

Jacob0226 · 2026-02-09T05:43:46Z

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens): Triton Error [HIP]: Code: 1, Messsage: invalid argument This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.
Server:
python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
Client:
python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1
Yeah, we find this question too. I will address this in the next PR.

The snapshot looks like a new PR. If this is not the PR to fix the long context issue, please leave a comment when the new PR is ready. Thanks.

Fridge003 · 2026-02-09T06:36:54Z

Hi @BJWang-ant, I also found this PR will cause errors on some extremely long context. So we need to revert this PR temporarily. Please combine this PR with fix together in the next PR.

BJWang-ant · 2026-02-09T06:38:28Z

Hi @BJWang-ant, I also found this PR will cause errors on some extremely long context. So we need to revert this PR temporarily. Please combine this PR with fix together in the next PR.

OK. Couldo you please give me some bad case?

Fridge003 · 2026-02-09T06:58:17Z

It can be easily hit when input length is 128k

xu-yfei · 2026-02-09T11:27:17Z

@BJWang-ant The original code has a bug in the CP (context parallel) scenario. Could you please include this fix as well?
#18280

BJWang-ant · 2026-02-10T01:49:46Z

@BJWang-ant The original code has a bug in the CP (context parallel) scenario. Could you please include this fix as well? #18280

OK

…gl-project#16043) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>

BJWang-ant requested review from Fridge003, Qiaolin-Yu, Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners December 29, 2025 06:26

BJWang-ant requested a review from hebiao064 as a code owner December 29, 2025 06:26

BJWang-ant marked this pull request as draft December 29, 2025 06:27

Fridge003 self-assigned this Dec 29, 2025

BJWang-ant force-pushed the opt-get-topk-ragged branch from 16212d6 to 8c3d453 Compare January 9, 2026 07:14

BJWang-ant changed the title ~~[wip] optimize get_index_k_scale_buffer to remote concat k and k_scale~~ [wip] optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel Jan 9, 2026

BJWang-ant marked this pull request as ready for review January 9, 2026 09:24

BJWang-ant force-pushed the opt-get-topk-ragged branch from 5332824 to 7ed5427 Compare January 9, 2026 09:50

BJWang-ant changed the title ~~[wip] optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel~~ optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel Jan 9, 2026

BJWang-ant changed the title ~~optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel~~ optimize get_index_k_scale_buffer by fuse get_k_scale into fuse kernel Jan 10, 2026

BJWang-ant force-pushed the opt-get-topk-ragged branch 4 times, most recently from 2b4b674 to 4507c0c Compare January 11, 2026 05:06

BJWang-ant changed the title ~~optimize get_index_k_scale_buffer by fuse get_k_scale into fuse kernel~~ optimize get_topk_ragged by fusing get k and k_scale triton kernel Jan 11, 2026

BJWang-ant mentioned this pull request Jan 16, 2026

[Perf][DeepSeekV3.2][Decode] Reduce small-kernel overhead in _get_topk_paged #15104

Open

Fridge003 mentioned this pull request Jan 16, 2026

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

32 tasks

BJWang-ant and others added 6 commits February 2, 2026 22:12

fuse get k_scale and ke&ks into two kernel

9908e06

add kernel test case

5bd7685

rebase code

6af0c79

update code format by pre-commit run

637517f

update format

2033c92

update test code

6680bbc

BJWang-ant force-pushed the opt-get-topk-ragged branch from 60ef273 to 6680bbc Compare February 3, 2026 06:13

Fridge003 approved these changes Feb 3, 2026

View reviewed changes

github-actions bot added the run-ci label Feb 3, 2026

Fridge003 merged commit 760ae93 into sgl-project:main Feb 4, 2026
317 of 377 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

optimize get_topk_ragged by fusing get k and k_scale triton kernel (s…

ec6a13b

…gl-project#16043) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>

xu-yfei mentioned this pull request Feb 5, 2026

[DeepSeek v3.2][Bugfix] get_index_k_scale_buffer support cp #18280

Closed

5 tasks

RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026

optimize get_topk_ragged by fusing get k and k_scale triton kernel (s…

eba8d5d

…gl-project#16043) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>

Fridge003 mentioned this pull request Feb 9, 2026

Revert "optimize get_topk_ragged by fusing get k and k_scale triton kernel" #18471

Merged

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

optimize get_topk_ragged by fusing get k and k_scale triton kernel (s…

fabebd7

…gl-project#16043) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>

Comments

Conversation

BJWang-ant commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 29, 2025

Uh oh!

BJWang-ant commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BJWang-ant commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jan 23, 2026

Uh oh!

Fridge003 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BJWang-ant commented Feb 2, 2026

Uh oh!

Fridge003 commented Feb 2, 2026

Uh oh!

BJWang-ant commented Feb 3, 2026

Uh oh!

Fridge003 commented Feb 3, 2026

Uh oh!

Uh oh!

Jacob0226 commented Feb 9, 2026

Uh oh!

BJWang-ant commented Feb 9, 2026

Uh oh!

BJWang-ant commented Feb 9, 2026

Uh oh!

Jacob0226 commented Feb 9, 2026

Uh oh!

BJWang-ant commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jacob0226 commented Feb 9, 2026

Uh oh!

Fridge003 commented Feb 9, 2026

Uh oh!

BJWang-ant commented Feb 9, 2026

Uh oh!

Fridge003 commented Feb 9, 2026

Uh oh!

xu-yfei commented Feb 9, 2026

Uh oh!

BJWang-ant commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BJWang-ant commented Dec 29, 2025 •

edited

Loading

BJWang-ant commented Jan 9, 2026 •

edited

Loading

BJWang-ant commented Jan 11, 2026 •

edited

Loading

Fridge003 commented Jan 23, 2026 •

edited

Loading

BJWang-ant commented Feb 9, 2026 •

edited

Loading