[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32.#14619
[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32.#14619
Conversation
- Introduced `enable_hierarchical_nsa` flag in `ServerArgs` for enabling hierarchical NSA. - Added `NSAHybridTokenToKVPoolAllocator` for managing separate allocations for KV cache and indexer_k. - Updated memory management and allocation functions to handle hierarchical NSA. - Modified relevant classes and methods to support new indexer_k functionality, including `NSAReqToTokenPool` and changes in `ScheduleBatch`. - Adjusted allocation logic in `alloc_for_extend` and `alloc_for_decode` to accommodate indexer_k handling. - Enhanced memory checking in `SchedulerRuntimeCheckerMixin` for hierarchical NSA. - Updated documentation and comments for clarity on new features and changes.
Sparse framework 1128
sparse diff triton kernel optimized
…9988 Sparse framework 1201 huangtingwei9988
…reduce cpu launch overhead
39b66d8 to
eb7ec6c
Compare
|
May I ask when PD separation will be supported? I can't wait~~~ @hzh0425 |
Will Support in one week |
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: MagicYang1573 <1328657938@qq.com> (cherry picked from commit a89e85e)
|
In python/sglang/srt/layers/attention/nsa_backend.py: class NativeSparseAttnBackend(...):
# ......
def init_forward_metadata_capture_cuda_graph()
# ......
if self.enable_nsa_hybrid_indexer_pool:
indexer_page_table_1 = self.decode_cuda_graph_metadata[
"indexer_page_table"
][:bs, :]
indexer_real_page_table = self._transform_table_1_to_real(
indexer_page_table_1
)
else:
indexer_real_page_table = NoneI think that Additionally, if allowed, I can submit a pr to fix that. I found that the draft decode related to MTP is incomplete. If permitted, I can complete it. |
|
hi, may I ask is there a plan to support ds3.2 L2 hicache in PD unified-mode? |
Yes; we are supporting this feature and expect to release it as soon as possible |
|
Thank you very much for the authors' efforts. During my attempts and tests, I found that when the GPU memory is insufficient (such as H20-96GiB), the maximum number of decode requests is still limited to a very low range. Would you be interested in discussing that? It might bring about improvements. This is the decode log information: # before
[2026-01-13 09:58:14 DP5 TP5 EP5] Decode batch, #running-req: 4, #token: 132096, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.18, #queue-req: 0,
[2026-01-13 09:58:15 DP2 TP2 EP2] Decode batch, #running-req: 4, #token: 131776, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.30, #queue-req: 0,
[2026-01-13 09:58:16 DP4 TP4 EP4] Decode batch, #running-req: 4, #token: 132288, token usage: 0.96, accept len: 1.99, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 27, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.19, #queue-req: 0,
[2026-01-13 09:58:16 DP3 TP3 EP3] Decode batch, #running-req: 4, #token: 132416, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 29, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.44, #queue-req: 0,
[2026-01-13 09:58:16 DP1 TP1 EP1] Decode batch, #running-req: 4, #token: 132032, token usage: 0.96, accept len: 1.99, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.14, #queue-req: 0,
[2026-01-13 09:58:17 DP0 TP0 EP0] Decode batch, #running-req: 4, #token: 132096, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.79, #queue-req: 0,
[2026-01-13 09:58:17 DP6 TP6 EP6] Decode batch, #running-req: 4, #token: 132224, token usage: 0.96, accept len: 1.98, accept rate: 0.99, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 114.34, #queue-req: 0,
[2026-01-13 09:58:17 DP7 TP7 EP7] Decode batch, #running-req: 4, #token: 132288, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.89, #queue-req: 0,
# after
[2026-01-13 12:32:06 DP7 TP7 EP7] Decode batch, #running-req: 30, #token: 100160, token usage: 0.73, accept len: 1.79, accept rate: 0.90, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 384.53, #queue-req: 0,
[2026-01-13 12:32:07 DP0 TP0 EP0] Decode batch, #running-req: 30, #token: 100288, token usage: 0.73, accept len: 1.71, accept rate: 0.86, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 375.49, #queue-req: 0,
[2026-01-13 12:32:07 DP4 TP4 EP4] Decode batch, #running-req: 28, #token: 100032, token usage: 0.73, accept len: 1.89, accept rate: 0.94, pre-allocated usage: 0.48, #prealloc-req: 2, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 416.05, #queue-req: 0,
[2026-01-13 12:32:07 DP1 TP1 EP1] Decode batch, #running-req: 27, #token: 99968, token usage: 0.73, accept len: 1.86, accept rate: 0.93, pre-allocated usage: 0.48, #prealloc-req: 2, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 389.93, #queue-req: 0,
[2026-01-13 12:32:07 DP3 TP3 EP3] Decode batch, #running-req: 29, #token: 100160, token usage: 0.73, accept len: 1.83, accept rate: 0.92, pre-allocated usage: 0.71, #prealloc-req: 0, #transfer-req: 3, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 384.83, #queue-req: 0,
[2026-01-13 12:32:10 DP5 TP5 EP5] Decode batch, #running-req: 30, #token: 100160, token usage: 0.73, accept len: 1.79, accept rate: 0.89, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 393.26, #queue-req: 0,
[2026-01-13 12:32:10 DP2 TP2 EP2] Decode batch, #running-req: 29, #token: 100224, token usage: 0.73, accept len: 1.90, accept rate: 0.95, pre-allocated usage: 0.71, #prealloc-req: 0, #transfer-req: 3, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 417.73, #queue-req: 0,
[2026-01-13 12:32:11 DP6 TP6 EP6] Decode batch, #running-req: 28, #token: 100032, token usage: 0.73, accept len: 1.62, accept rate: 0.81, pre-allocated usage: 0.48, #prealloc-req: 0, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 339.64, #queue-req: 0, |
Thanks! How can I contact you on Slack? @yuyu5333 |
Motivation
Previously, Sparse Attention was introduced in DeepSeekV32 Mode, which selects only the top-2048 tokens to participate in attention computation, significantly improving inference performance in long-context scenarios.
However, this model still requires caching the full KV cache in GPU memory, meaning the storage footprint of the KV cache is not reduced. This limits the number of concurrent long-context requests the system can handle.
Therefore, we aim to integrate HICache to implement a hierarchical DSA—where the full KV cache is stored in CPU or remote store, and only 2,048 tokens per request are kept in GPU memory. This significantly increases the batch size during the decode phase and improves overall decode throughput.
CoAuthors: @huangtingwei9988 @xiezhq-hermann @LingYeAI and hicache team
Upstream Issue:#12826
This is the upstream branch, and we will split it into multiple smaller PRs to move forward.
Todo PR:
Modifications
We have integrated HICache to build a unified hierarchical sparse framework that supports various sparse attention algorithms—such as Quest and ClusterKV—for hierarchical, sparsity-aware KV cache management.
Implementation of Hierarchical DSA:
Accuracy Tests
We evaluated several LongBench tests, and the accuracy is nearly identical to that of the original DSA version.

Benchmarking and Profiling
Comparison Setup:
Checklist