[DeepSeek V3.2] Enable trtllm NSA with bf16 kvcache#16758
Merged
Fridge003 merged 4 commits intosgl-project:mainfrom Jan 23, 2026
Merged
[DeepSeek V3.2] Enable trtllm NSA with bf16 kvcache#16758Fridge003 merged 4 commits intosgl-project:mainfrom
Fridge003 merged 4 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@Fridge003 Could you please take a look? I only just now saw #15546, seems like the changes are similar though here I am using |
Fridge003
previously requested changes
Jan 19, 2026
Collaborator
|
@akhilg-nv Please fix the conflict |
393f224 to
0c2da34
Compare
Collaborator
|
@akhilg-nv Please fix lint with |
This was referenced Jan 20, 2026
959b851 to
12224d4
Compare
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/rerun-failed-ci |
hlu1
reviewed
Jan 21, 2026
5622465 to
03ac8f3
Compare
03ac8f3 to
2162e7e
Compare
hlu1
approved these changes
Jan 23, 2026
hlu1
reviewed
Jan 23, 2026
2162e7e to
306072d
Compare
Collaborator
|
/tag-and-rerun-ci |
Contributor
|
The PR mentions speculative decoding testing - I tried but got: It looks like the extend path wasn’t updated for Edit: Opened a quick follow‑up PR (#17662) to address this. |
5 tasks
Johnsonms
pushed a commit
to Johnsonms/sglang
that referenced
this pull request
Feb 14, 2026
Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Enables NSA backend with trtllm kernels for sparse attention. This can be more efficient than FlashMLA when the head size isn't a multiple of 64 and hence requires padding. This PR enables with BF16 KVCache, FP8 will follow in another PR.
Modifications
This change interfaces with the new kernel added in flashinfer to use trtllm kernel for decode in NSA backend. There are also modifications made to
server_args.pyto enable the new option.Accuracy Tests
Testing MTP
--speculative-algorithm EAGLE --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3GPQA:
LongBenchV2:
python -m sglang.test.run_eval \ --eval-name longbench_v2 \ --host 127.0.0.1 \ --port 30000 \ --model deepseek-ai/DeepSeek-V3.2 \ --max-context-length 128000 \ --max-tokens 16384 \ --num-threads 16Benchmarking and Profiling
Trace shows the new kernel call
(about 14 microseconds):
Trace with the FlashMLA path with bf16 KVCache shows the FlashMLA sparse kernel is slower (about 26.5 microseconds):
Comparing FlashMLA sparse and TRTLLM (both with bf16 kvcache) shows similar average results for decode.
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2 --warmup-requests 5 --max-concurrency 16 --num-prompts 80 --random-range-ratio 0.8 --random-input-len 4096 --random-output-len 4096 --profile --profile-by-stageFlashMLA Sparse (BF16 KVCache)
TRTLLM Sparse NSA
Unit benchmarking shows higher bandwidth for the new trtllm kernel compared to flashmla.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.