Skip to content

[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32.#14619

Open
hzh0425 wants to merge 41 commits intomainfrom
hicache/sparse
Open

[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32.#14619
hzh0425 wants to merge 41 commits intomainfrom
hicache/sparse

Conversation

@hzh0425
Copy link
Collaborator

@hzh0425 hzh0425 commented Dec 8, 2025

Motivation

Previously, Sparse Attention was introduced in DeepSeekV32 Mode, which selects only the top-2048 tokens to participate in attention computation, significantly improving inference performance in long-context scenarios.

However, this model still requires caching the full KV cache in GPU memory, meaning the storage footprint of the KV cache is not reduced. This limits the number of concurrent long-context requests the system can handle.

Therefore, we aim to integrate HICache to implement a hierarchical DSA—where the full KV cache is stored in CPU or remote store, and only 2,048 tokens per request are kept in GPU memory. This significantly increases the batch size during the decode phase and improves overall decode throughput.

CoAuthors: @huangtingwei9988 @xiezhq-hermann @LingYeAI and hicache team
Upstream Issue:#12826
This is the upstream branch, and we will split it into multiple smaller PRs to move forward.
Todo PR:

Modifications

  1. We have integrated HICache to build a unified hierarchical sparse framework that supports various sparse attention algorithms—such as Quest and ClusterKV—for hierarchical, sparsity-aware KV cache management.

  2. Implementation of Hierarchical DSA:

    • Decoupled the DSA indexer from KV cache allocation and the req_to_token_pool mechanism: the full indexer remains on the GPU.
    • During the decode phase, each request is allocated a fixed token space of 2,048 tokens in GPU memory.
    • Before each attention layer begins, a diff kernel identifies the missing KV cache entries in GPU memory, and HICache incrementally fetches only those missing portions from CPU memory. Our observations show that, on average, only about 20% of the KV cache needs to be fetched incrementally per layer.
image

Accuracy Tests

We evaluated several LongBench tests, and the accuracy is nearly identical to that of the original DSA version.
image

Benchmarking and Profiling

  1. Comparison Setup:

    • Model: dsv32 on H20 (140 GB)
    • Configuration:8DP, disabled radix tree、overlap scheduling, Average sequence length: 16K
    • Native Mode: Maximum supported batch size ≈ (249,472 / 16,384) × 8 ≈ 120
    • Hierarchical Mode: Maximum supported batch size ≈ (249,472 / 2,048) × 8 ≈ 900; Host memory usage per rank: ~72 GB
    • Decode Throughput Comparison is as follows: (The current performance is not yet optimal. We will continue refining and optimizing, using computation-communication overlap to hide the latency introduced by additional I/O.)
image
  1. Below is an analysis of GPU memory usage in a long-text generation scenario with 4K input tokens, 16 output tokens, and 12 concurrent requests. As shown, the hierarchical DSA maintains GPU memory utilization at only around 10%, whereas the normal mode quickly exhausts nearly all available GPU memory.
image

Checklist

hzh0425 and others added 26 commits November 20, 2025 13:44
- Introduced `enable_hierarchical_nsa` flag in `ServerArgs` for enabling hierarchical NSA.
- Added `NSAHybridTokenToKVPoolAllocator` for managing separate allocations for KV cache and indexer_k.
- Updated memory management and allocation functions to handle hierarchical NSA.
- Modified relevant classes and methods to support new indexer_k functionality, including `NSAReqToTokenPool` and changes in `ScheduleBatch`.
- Adjusted allocation logic in `alloc_for_extend` and `alloc_for_decode` to accommodate indexer_k handling.
- Enhanced memory checking in `SchedulerRuntimeCheckerMixin` for hierarchical NSA.
- Updated documentation and comments for clarity on new features and changes.
sparse diff  triton kernel optimized
…9988

Sparse framework 1201 huangtingwei9988
@wqlxx
Copy link

wqlxx commented Dec 15, 2025

May I ask when PD separation will be supported? I can't wait~~~ @hzh0425

@hzh0425
Copy link
Collaborator Author

hzh0425 commented Dec 15, 2025

May I ask when PD separation will be supported? I can't wait~~~ @hzh0425

Will Support in one week

hzh0425 and others added 3 commits January 3, 2026 16:08
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: MagicYang1573 <1328657938@qq.com>

(cherry picked from commit a89e85e)
@yuyu5333
Copy link
Contributor

yuyu5333 commented Jan 6, 2026

In python/sglang/srt/layers/attention/nsa_backend.py:

class NativeSparseAttnBackend(...):
    # ......

    def init_forward_metadata_capture_cuda_graph()
        # ......
        if self.enable_nsa_hybrid_indexer_pool:
            indexer_page_table_1 = self.decode_cuda_graph_metadata[
                "indexer_page_table"
            ][:bs, :]
            indexer_real_page_table = self._transform_table_1_to_real(
                indexer_page_table_1
            )
        else:
            indexer_real_page_table = None

I think that indexer_page_table_1 should consider the case where mtp is enabled. It is necessary to set the shape of indexer_page_table_1 to bs * speculative_num_draft_tokens to avoid dimension check failure in deep_gemm.fp8_paged_mqa_logits.

Additionally, if allowed, I can submit a pr to fix that. I found that the draft decode related to MTP is incomplete. If permitted, I can complete it.

@hzh0425
Copy link
Collaborator Author

hzh0425 commented Jan 6, 2026

In python/sglang/srt/layers/attention/nsa_backend.py:

class NativeSparseAttnBackend(...):

Thanks for report! @yuyu5333
#15807

We had previously discussed this issue. Perhaps you could submit a PR to help fix it after this pr merged

@mikezou2026-zen
Copy link

hi, may I ask is there a plan to support ds3.2 L2 hicache in PD unified-mode?

@hzh0425
Copy link
Collaborator Author

hzh0425 commented Jan 7, 2026

hi, may I ask is there a plan to support ds3.2 L2 hicache in PD unified-mode?

Yes; we are supporting this feature and expect to release it as soon as possible

@yuyu5333
Copy link
Contributor

Thank you very much for the authors' efforts. During my attempts and tests, I found that when the GPU memory is insufficient (such as H20-96GiB), the maximum number of decode requests is still limited to a very low range.
Therefore, I made some modifications to Hicache kv offload, which significantly improved the overall throughput. The cost is a slight increase in TPOT, but there is no impact on accuracy.

Would you be interested in discussing that? It might bring about improvements.

This is the decode log information:

# before
[2026-01-13 09:58:14 DP5 TP5 EP5] Decode batch, #running-req: 4, #token: 132096, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.18, #queue-req: 0,
[2026-01-13 09:58:15 DP2 TP2 EP2] Decode batch, #running-req: 4, #token: 131776, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.30, #queue-req: 0,
[2026-01-13 09:58:16 DP4 TP4 EP4] Decode batch, #running-req: 4, #token: 132288, token usage: 0.96, accept len: 1.99, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 27, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.19, #queue-req: 0,
[2026-01-13 09:58:16 DP3 TP3 EP3] Decode batch, #running-req: 4, #token: 132416, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 29, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.44, #queue-req: 0,
[2026-01-13 09:58:16 DP1 TP1 EP1] Decode batch, #running-req: 4, #token: 132032, token usage: 0.96, accept len: 1.99, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.14, #queue-req: 0,
[2026-01-13 09:58:17 DP0 TP0 EP0] Decode batch, #running-req: 4, #token: 132096, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.79, #queue-req: 0,
[2026-01-13 09:58:17 DP6 TP6 EP6] Decode batch, #running-req: 4, #token: 132224, token usage: 0.96, accept len: 1.98, accept rate: 0.99, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 114.34, #queue-req: 0,
[2026-01-13 09:58:17 DP7 TP7 EP7] Decode batch, #running-req: 4, #token: 132288, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.89, #queue-req: 0,


# after
[2026-01-13 12:32:06 DP7 TP7 EP7] Decode batch, #running-req: 30, #token: 100160, token usage: 0.73, accept len: 1.79, accept rate: 0.90, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 384.53, #queue-req: 0,
[2026-01-13 12:32:07 DP0 TP0 EP0] Decode batch, #running-req: 30, #token: 100288, token usage: 0.73, accept len: 1.71, accept rate: 0.86, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 375.49, #queue-req: 0,
[2026-01-13 12:32:07 DP4 TP4 EP4] Decode batch, #running-req: 28, #token: 100032, token usage: 0.73, accept len: 1.89, accept rate: 0.94, pre-allocated usage: 0.48, #prealloc-req: 2, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 416.05, #queue-req: 0,
[2026-01-13 12:32:07 DP1 TP1 EP1] Decode batch, #running-req: 27, #token: 99968, token usage: 0.73, accept len: 1.86, accept rate: 0.93, pre-allocated usage: 0.48, #prealloc-req: 2, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 389.93, #queue-req: 0,
[2026-01-13 12:32:07 DP3 TP3 EP3] Decode batch, #running-req: 29, #token: 100160, token usage: 0.73, accept len: 1.83, accept rate: 0.92, pre-allocated usage: 0.71, #prealloc-req: 0, #transfer-req: 3, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 384.83, #queue-req: 0,
[2026-01-13 12:32:10 DP5 TP5 EP5] Decode batch, #running-req: 30, #token: 100160, token usage: 0.73, accept len: 1.79, accept rate: 0.89, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 393.26, #queue-req: 0,
[2026-01-13 12:32:10 DP2 TP2 EP2] Decode batch, #running-req: 29, #token: 100224, token usage: 0.73, accept len: 1.90, accept rate: 0.95, pre-allocated usage: 0.71, #prealloc-req: 0, #transfer-req: 3, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 417.73, #queue-req: 0,
[2026-01-13 12:32:11 DP6 TP6 EP6] Decode batch, #running-req: 28, #token: 100032, token usage: 0.73, accept len: 1.62, accept rate: 0.81, pre-allocated usage: 0.48, #prealloc-req: 0, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 339.64, #queue-req: 0,

@hzh0425
Copy link
Collaborator Author

hzh0425 commented Jan 14, 2026

Thank you very much for the authors' efforts. During my attempts and tests, I found that when the GPU memory is insufficient (such as H20-96GiB), the maximum number of decode requests is still limited to a very low range.
Therefore, I made som

Thanks! How can I contact you on Slack? @yuyu5333
Can you ping me 'Zhangheng Huang'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek hicache Hierarchical Caching for SGLang sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants