Skip to content

Comments

optimize get_topk_ragged by fusing get k and k_scale triton kernel#16043

Merged
Fridge003 merged 6 commits intosgl-project:mainfrom
BJWang-ant:opt-get-topk-ragged
Feb 4, 2026
Merged

optimize get_topk_ragged by fusing get k and k_scale triton kernel#16043
Fridge003 merged 6 commits intosgl-project:mainfrom
BJWang-ant:opt-get-topk-ragged

Conversation

@BJWang-ant
Copy link
Contributor

@BJWang-ant BJWang-ant commented Dec 29, 2025

  • Summary

Optimize the get_index_k_scale_buffer function to reduce the number of concatenation operations of k_fp8 and k_scale.
image

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BJWang-ant BJWang-ant marked this pull request as draft December 29, 2025 06:27
@Fridge003 Fridge003 self-assigned this Dec 29, 2025
@BJWang-ant BJWang-ant force-pushed the opt-get-topk-ragged branch from 16212d6 to 8c3d453 Compare January 9, 2026 07:14
@BJWang-ant BJWang-ant changed the title [wip] optimize get_index_k_scale_buffer to remote concat k and k_scale [wip] optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel Jan 9, 2026
@BJWang-ant
Copy link
Contributor Author

BJWang-ant commented Jan 9, 2026

The above commit splits the for loops for getting k\scale and ke\ks into two separate kernels.Judging solely from my test results。
From the trace graph, it can be seen that when ISL = 1600 + 3700, the processing time can be reduced from 420 us to approximately 21 us.

image image

command:

  • PD separation:2P1D,Prefill is B200,Decode is H200.

  • Prefill server command:
    NVSHMEM_IB_GID_INDEX=3
    TORCH_CUDA_ARCH_LIST="9.0"
    NVSHMEM_IB_TRAFFIC_CLASS=184
    MC_SLICE_SIZE=262144
    GLOO_SOCKET_IFNAME=eth4
    NCCL_SOCKET_IFNAME=eth4
    MC_TE_METRIC=true
    SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
    SGLANG_DG_CACHE_DIR=xxxxx
    python3 -m sglang.launch_server
    --model-path /deepseek-ai/DeepSeek-V3.2
    --disaggregation-mode prefill
    --disaggregation-ib-device mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11
    --dist-init-addr $p0:5757
    --nnodes 1
    --node-rank $pi
    --enable-dp-attention
    --tp 8
    --dp 2
    --ep 8
    --enable-dp-attention
    --enable-metrics
    --enable-metrics-for-all-schedulers
    --enable-expert-distribution-metrics
    --decode-log-interval 1
    --moe-a2a-backend deepep
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --moe-dense-tp-size 1
    --enable-dp-lm-head
    --enable-cache-report
    --watchdog-timeout 1000000
    --deepep-mode normal
    --mem-fraction-static 0.85
    --max-running-requests 512
    --chunked-prefill-size 32768
    --max-prefill-tokens 32768
    --enable-eplb
    --eplb-rebalance-num-iterations 1000
    --ep-dispatch-algorithm dynamic
    --eplb-algorithm deepseek
    --expert-distribution-recorder-mode stat
    --deepep-config /configs/deepep.json
    --page-size 64
    --log-level debug
    --context-length 23000

  • Decode server command
    GLOO_SOCKET_IFNAME=eth1
    GLOO_SOCKET_TIMEOUT_MS=60000
    TORCH_CUDA_ARCH_LIST="9.0"
    NCCL_SOCKET_IFNAME=eth1
    MC_TE_METRIC=true
    SGLANG_TBO_DEBUG=1
    NVSHMEM_HCA_LIST=mlx5_4,mlx5_5,mlx5_6,mlx5_7
    NVSHMEM_IB_GID_INDEX=3
    NVSHMEM_IB_ENABLE_IBGDA=true
    NVSHMEM_IB_TRAFFIC_CLASS=184
    NVSHMEM_BOOTSTRAP_UID_SOCK_FAMILY=AF_INET
    NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=eth1
    DEEPEP_DIAGNOSE_INTERVAL=120
    DEEPEP_DIAGNOSE_LOG_DETAILS=1
    SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
    SGLANG_DG_CACHE_DIR=xxxxxxx
    python3 -m sglang.launch_server
    --model-path /deepseek-ai/DeepSeek-V3.2
    --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
    --disaggregation-transfer-backend mooncake
    --dist-init-addr $d0:5757
    --nnodes 1
    --node-rank $di
    --tp 8
    --ep 8
    --dp 8
    --enable-dp-attention
    --decode-log-interval 1
    --enable-metrics
    --enable-metrics-for-all-schedulers
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --moe-dense-tp-size 1
    --enable-dp-lm-head
    --enable-cache-report
    --disaggregation-mode decode
    --watchdog-timeout 1000000
    --deepep-mode low_latency
    --mem-fraction-static 0.85
    --max-running-requests 256
    --context-length 23000
    --nsa-decode-backend 'flashmla_kv'
    --quantization 'fp8'
    --kv-cache-dtype 'fp8_e4m3'
    --cuda-graph-max-bs 32
    --moe-a2a-backend deepep
    --speculative-algorithm EAGLE
    --speculative-num-steps 3
    --speculative-eagle-topk 1
    --speculative-num-draft-tokens 4
    --enable-flashinfer-allreduce-fusion
    --speculative-attention-mode decode
    --prefill-round-robin-balance
    --page-size 64
    --log-level debug
    --load-balance-method round_robin

  • bench-mark command:
    python3 -m sglang.bench_serving --model /deepseek-ai/DeepSeek-V3.2 --base-url http://xxxx:xxxxx --dataset-name generated-shared-prefix --gsp-num-groups 2 --gsp-prompts-per-group 5000 --gsp-system-prompt-len 1600 --gsp-question-len 3700 --gsp-output-len 200 --warmup-requests 100 --max-concurrency 64 --dataset-path xxxx

@BJWang-ant BJWang-ant marked this pull request as ready for review January 9, 2026 09:24
@BJWang-ant BJWang-ant force-pushed the opt-get-topk-ragged branch from 5332824 to 7ed5427 Compare January 9, 2026 09:50
@BJWang-ant BJWang-ant changed the title [wip] optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel Jan 9, 2026
@BJWang-ant BJWang-ant changed the title optimize get_index_k_scale_buffer by fuse get_k_scale and get_ke_ks two fuse kernel optimize get_index_k_scale_buffer by fuse get_k_scale into fuse kernel Jan 10, 2026
@BJWang-ant BJWang-ant force-pushed the opt-get-topk-ragged branch 4 times, most recently from 2b4b674 to 4507c0c Compare January 11, 2026 05:06
@BJWang-ant
Copy link
Contributor Author

BJWang-ant commented Jan 11, 2026

Please help me review code. Thanks!

@BJWang-ant BJWang-ant changed the title optimize get_index_k_scale_buffer by fuse get_k_scale into fuse kernel optimize get_topk_ragged by fusing get k and k_scale triton kernel Jan 11, 2026
@Fridge003
Copy link
Collaborator

@BJWang-ant Please fix conflict

@Fridge003
Copy link
Collaborator

Fridge003 commented Jan 23, 2026

Please post the GPQA/AIME2025 results as the instructions here
https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond

@BJWang-ant
Copy link
Contributor Author

@BJWang-ant Can you please post the GPQA result?

Also are you posting V3.2 and V3.2 speciale in reverse order? I feel the result for V3.2 speciale is not correct (speciale doesn't use tool call, and needs 120k output length)

If V3.2 (not speciale) reaches 93, I think this PR is correct, then we can ignore the result on speciale

I will test again, and post the GPQA result.

@Fridge003
Copy link
Collaborator

Please fix lint with

pre-commit install
pre-commit run --all-files

@BJWang-ant
Copy link
Contributor Author

@BJWang-ant Can you please post the GPQA result?

Also are you posting V3.2 and V3.2 speciale in reverse order? I feel the result for V3.2 speciale is not correct (speciale doesn't use tool call, and needs 120k output length)

If V3.2 (not speciale) reaches 93, I think this PR is correct, then we can ignore the result on speciale

I test AIME2025 again,the same result was obtained.
this is GPQA result:
image

@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Feb 3, 2026
@Fridge003 Fridge003 merged commit 760ae93 into sgl-project:main Feb 4, 2026
317 of 377 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026
RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026
@Jacob0226
Copy link

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens):
Triton Error [HIP]: Code: 1, Messsage: invalid argument
This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.

Server:

python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang

Client:

python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

@BJWang-ant
Copy link
Contributor Author

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens): Triton Error [HIP]: Code: 1, Messsage: invalid argument This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.

Server:

python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang

Client:

python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

Yeah, we find this question too. I will address this in the next PR.
image

@BJWang-ant
Copy link
Contributor Author

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens): Triton Error [HIP]: Code: 1, Messsage: invalid argument This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.

Server:

python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang

Client:

python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

I will push code either today or tomorrow.

@Jacob0226
Copy link

@BJWang-ant Thanks for the update. The snapshot looks like you have another PR page. Could you share the PR link so I can run some tests once it’s merged?

@BJWang-ant
Copy link
Contributor Author

BJWang-ant commented Feb 9, 2026

@BJWang-ant Thanks for the update. The snapshot looks like you have another PR page. Could you share the PR link so I can run some tests once it’s merged?

I'm not quite sure which PR you are referring to.

@Jacob0226
Copy link

Hi, after this PR was merged, DeepSeek-v3.2 encounters the following error when running with a long context (> 65,536 tokens): Triton Error [HIP]: Code: 1, Messsage: invalid argument This occurs because the Triton kernel grid is configured as grid = (seq_num, max_seq_len). When max_seq_len exceeds 65,536, the kernel launch fails since HIP does not support launching more than 65,536 blocks in the Y dimension. I suspect that CUDA GPUs may also have the same limitation and may not support launching more than 65,536 blocks in the Y dimension either.
Server:

python3 -m sglang.launch_server \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp \
    --mem-fraction-static 0.7 --tp 8 --port 8552 --trust-remote-code \
    --disable-radix-cache --chunked-prefill-size 131072 \
    --nsa-prefill-backend tilelang --nsa-decode-backend tilelang

Client:

python3 -m sglang.bench_serving \
    --port  8552 \
    --backend sglang \
    --model /data/huggingface/hub/deepseek-ai/DeepSeek-V3.2-Exp  \
    --dataset-name random \
    --random-input 65537 \
    --random-output 600 \
    --random-range-ratio 1.0 \
    --num-prompts 1 \
    --max-concurrency 1

Yeah, we find this question too. I will address this in the next PR. image

The snapshot looks like a new PR. If this is not the PR to fix the long context issue, please leave a comment when the new PR is ready. Thanks.

@Fridge003
Copy link
Collaborator

Hi @BJWang-ant, I also found this PR will cause errors on some extremely long context. So we need to revert this PR temporarily. Please combine this PR with fix together in the next PR.

@BJWang-ant
Copy link
Contributor Author

Hi @BJWang-ant, I also found this PR will cause errors on some extremely long context. So we need to revert this PR temporarily. Please combine this PR with fix together in the next PR.

OK. Couldo you please give me some bad case?

@Fridge003
Copy link
Collaborator

It can be easily hit when input length is 128k

@xu-yfei
Copy link
Contributor

xu-yfei commented Feb 9, 2026

@BJWang-ant The original code has a bug in the CP (context parallel) scenario. Could you please include this fix as well?
#18280

@BJWang-ant
Copy link
Contributor Author

@BJWang-ant The original code has a bug in the CP (context parallel) scenario. Could you please include this fix as well? #18280

OK

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants