[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32. by hzh0425 · Pull Request #14619 · sgl-project/sglang

hzh0425 · 2025-12-08T06:53:13Z

Motivation

Previously, Sparse Attention was introduced in DeepSeekV32 Mode, which selects only the top-2048 tokens to participate in attention computation, significantly improving inference performance in long-context scenarios.

However, this model still requires caching the full KV cache in GPU memory, meaning the storage footprint of the KV cache is not reduced. This limits the number of concurrent long-context requests the system can handle.

Therefore, we aim to integrate HICache to implement a hierarchical DSA—where the full KV cache is stored in CPU or remote store, and only 2,048 tokens per request are kept in GPU memory. This significantly increases the batch size during the decode phase and improves overall decode throughput.

CoAuthors: @huangtingwei9988 @xiezhq-hermann @LingYeAI and hicache team
Upstream Issue：#12826
This is the upstream branch, and we will split it into multiple smaller PRs to move forward.
Todo PR:

Define Sparse Interface: [1/N][Sparse With Hicache]: Add Sparse Interface #14741
Seperate Index_k pool and kvcache pool [2/N][Sparse With Hicache]: Support separating nsa memory management for KV cache and index_k in decode side. #15807
Adding SparseCoodinator and attention backend adaptor [3/N][Sparse With Hicache]: Init sparse coordinator #16086
Support sparse diff kenrel & Support IO Transfer Logic Hicache sparse coordinator #16984
Support DeepSeek DSA
Support two batch overlap

Modifications

We have integrated HICache to build a unified hierarchical sparse framework that supports various sparse attention algorithms—such as Quest and ClusterKV—for hierarchical, sparsity-aware KV cache management.
Implementation of Hierarchical DSA:
- Decoupled the DSA indexer from KV cache allocation and the req_to_token_pool mechanism: the full indexer remains on the GPU.
- During the decode phase, each request is allocated a fixed token space of 2,048 tokens in GPU memory.
- Before each attention layer begins, a diff kernel identifies the missing KV cache entries in GPU memory, and HICache incrementally fetches only those missing portions from CPU memory. Our observations show that, on average, only about 20% of the KV cache needs to be fetched incrementally per layer.

Accuracy Tests

We evaluated several LongBench tests, and the accuracy is nearly identical to that of the original DSA version.

Benchmarking and Profiling

Comparison Setup:
- Model: dsv32 on H20 (140 GB)
- Configuration:8DP, disabled radix tree、overlap scheduling, Average sequence length: 16K
- Native Mode: Maximum supported batch size ≈ (249,472 / 16,384) × 8 ≈ 120
- Hierarchical Mode: Maximum supported batch size ≈ (249,472 / 2,048) × 8 ≈ 900; Host memory usage per rank: ~72 GB
- Decode Throughput Comparison is as follows: (The current performance is not yet optimal. We will continue refining and optimizing, using computation-communication overlap to hide the latency introduced by additional I/O.)

Below is an analysis of GPU memory usage in a long-text generation scenario with 4K input tokens, 16 output tokens, and 12 concurrent requests. As shown, the hierarchical DSA maintains GPU memory utilization at only around 10%, whereas the normal mode quickly exhausts nearly all available GPU memory.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

- Introduced `enable_hierarchical_nsa` flag in `ServerArgs` for enabling hierarchical NSA. - Added `NSAHybridTokenToKVPoolAllocator` for managing separate allocations for KV cache and indexer_k. - Updated memory management and allocation functions to handle hierarchical NSA. - Modified relevant classes and methods to support new indexer_k functionality, including `NSAReqToTokenPool` and changes in `ScheduleBatch`. - Adjusted allocation logic in `alloc_for_extend` and `alloc_for_decode` to accommodate indexer_k handling. - Enhanced memory checking in `SchedulerRuntimeCheckerMixin` for hierarchical NSA. - Updated documentation and comments for clarity on new features and changes.

Sparse framework 1128

sparse diff triton kernel optimized

…9988 Sparse framework 1201 huangtingwei9988

…reduce cpu launch overhead

wqlxx · 2025-12-15T02:28:08Z

May I ask when PD separation will be supported? I can't wait~~~ @hzh0425

hzh0425 · 2025-12-15T11:53:42Z

May I ask when PD separation will be supported? I can't wait~~~ @hzh0425

Will Support in one week

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: MagicYang1573 <1328657938@qq.com> (cherry picked from commit a89e85e)

yuyu5333 · 2026-01-06T07:16:14Z

In python/sglang/srt/layers/attention/nsa_backend.py:

class NativeSparseAttnBackend(...):
    # ......

    def init_forward_metadata_capture_cuda_graph()
        # ......
        if self.enable_nsa_hybrid_indexer_pool:
            indexer_page_table_1 = self.decode_cuda_graph_metadata[
                "indexer_page_table"
            ][:bs, :]
            indexer_real_page_table = self._transform_table_1_to_real(
                indexer_page_table_1
            )
        else:
            indexer_real_page_table = None

I think that indexer_page_table_1 should consider the case where mtp is enabled. It is necessary to set the shape of indexer_page_table_1 to bs * speculative_num_draft_tokens to avoid dimension check failure in deep_gemm.fp8_paged_mqa_logits.

Additionally, if allowed, I can submit a pr to fix that. I found that the draft decode related to MTP is incomplete. If permitted, I can complete it.

hzh0425 · 2026-01-06T07:45:30Z

In python/sglang/srt/layers/attention/nsa_backend.py:
class NativeSparseAttnBackend(...):

Thanks for report! @yuyu5333
#15807

We had previously discussed this issue. Perhaps you could submit a PR to help fix it after this pr merged

mikezou2026-zen · 2026-01-07T09:59:29Z

hi, may I ask is there a plan to support ds3.2 L2 hicache in PD unified-mode?

hzh0425 · 2026-01-07T10:02:59Z

hi, may I ask is there a plan to support ds3.2 L2 hicache in PD unified-mode?

Yes; we are supporting this feature and expect to release it as soon as possible

yuyu5333 · 2026-01-14T08:13:38Z

Thank you very much for the authors' efforts. During my attempts and tests, I found that when the GPU memory is insufficient (such as H20-96GiB), the maximum number of decode requests is still limited to a very low range.
Therefore, I made some modifications to Hicache kv offload, which significantly improved the overall throughput. The cost is a slight increase in TPOT, but there is no impact on accuracy.

Would you be interested in discussing that? It might bring about improvements.

This is the decode log information:

# before
[2026-01-13 09:58:14 DP5 TP5 EP5] Decode batch, #running-req: 4, #token: 132096, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.18, #queue-req: 0,
[2026-01-13 09:58:15 DP2 TP2 EP2] Decode batch, #running-req: 4, #token: 131776, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.30, #queue-req: 0,
[2026-01-13 09:58:16 DP4 TP4 EP4] Decode batch, #running-req: 4, #token: 132288, token usage: 0.96, accept len: 1.99, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 27, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.19, #queue-req: 0,
[2026-01-13 09:58:16 DP3 TP3 EP3] Decode batch, #running-req: 4, #token: 132416, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 29, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.44, #queue-req: 0,
[2026-01-13 09:58:16 DP1 TP1 EP1] Decode batch, #running-req: 4, #token: 132032, token usage: 0.96, accept len: 1.99, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.14, #queue-req: 0,
[2026-01-13 09:58:17 DP0 TP0 EP0] Decode batch, #running-req: 4, #token: 132096, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.79, #queue-req: 0,
[2026-01-13 09:58:17 DP6 TP6 EP6] Decode batch, #running-req: 4, #token: 132224, token usage: 0.96, accept len: 1.98, accept rate: 0.99, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 114.34, #queue-req: 0,
[2026-01-13 09:58:17 DP7 TP7 EP7] Decode batch, #running-req: 4, #token: 132288, token usage: 0.96, accept len: 2.00, accept rate: 1.00, pre-allocated usage: 0.00, #prealloc-req: 28, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 115.89, #queue-req: 0,


# after
[2026-01-13 12:32:06 DP7 TP7 EP7] Decode batch, #running-req: 30, #token: 100160, token usage: 0.73, accept len: 1.79, accept rate: 0.90, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 384.53, #queue-req: 0,
[2026-01-13 12:32:07 DP0 TP0 EP0] Decode batch, #running-req: 30, #token: 100288, token usage: 0.73, accept len: 1.71, accept rate: 0.86, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 375.49, #queue-req: 0,
[2026-01-13 12:32:07 DP4 TP4 EP4] Decode batch, #running-req: 28, #token: 100032, token usage: 0.73, accept len: 1.89, accept rate: 0.94, pre-allocated usage: 0.48, #prealloc-req: 2, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 416.05, #queue-req: 0,
[2026-01-13 12:32:07 DP1 TP1 EP1] Decode batch, #running-req: 27, #token: 99968, token usage: 0.73, accept len: 1.86, accept rate: 0.93, pre-allocated usage: 0.48, #prealloc-req: 2, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 389.93, #queue-req: 0,
[2026-01-13 12:32:07 DP3 TP3 EP3] Decode batch, #running-req: 29, #token: 100160, token usage: 0.73, accept len: 1.83, accept rate: 0.92, pre-allocated usage: 0.71, #prealloc-req: 0, #transfer-req: 3, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 384.83, #queue-req: 0,
[2026-01-13 12:32:10 DP5 TP5 EP5] Decode batch, #running-req: 30, #token: 100160, token usage: 0.73, accept len: 1.79, accept rate: 0.89, pre-allocated usage: 0.48, #prealloc-req: 1, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 393.26, #queue-req: 0,
[2026-01-13 12:32:10 DP2 TP2 EP2] Decode batch, #running-req: 29, #token: 100224, token usage: 0.73, accept len: 1.90, accept rate: 0.95, pre-allocated usage: 0.71, #prealloc-req: 0, #transfer-req: 3, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 417.73, #queue-req: 0,
[2026-01-13 12:32:11 DP6 TP6 EP6] Decode batch, #running-req: 28, #token: 100032, token usage: 0.73, accept len: 1.62, accept rate: 0.81, pre-allocated usage: 0.48, #prealloc-req: 0, #transfer-req: 2, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 339.64, #queue-req: 0,

hzh0425 · 2026-01-14T08:24:02Z

Thank you very much for the authors' efforts. During my attempts and tests, I found that when the GPU memory is insufficient (such as H20-96GiB), the maximum number of decode requests is still limited to a very low range.
Therefore, I made som

Thanks! How can I contact you on Slack? @yuyu5333
Can you ping me 'Zhangheng Huang'

hzh0425 and others added 26 commits November 20, 2025 13:44

init framework

ca14421

[Sparse]: support sparse io schedule

e02374e

[Sparse NSA]: Support truncate mem and page table when prompt len > 2048

12d29f4

[Sparse]: Support HICache Integrate; fix some bugs;

a3e5cdc

[Sparse NSA]: Init NSA Integrate

7448e15

[nsa sparse]: Optimize Code style

bacec58

Successfully ran DSA for the first time.

67fba78

support SparseKVCacheManager & fuse sparse triton kernel

b3882d0

fix bug

6e0d18a

Merge pull request #1 from hzh0425/sparse-framework_1128

822db3e

Sparse framework 1128

tmp optimized

1457e3c

Merge pull request #2 from hzh0425/sparse-framework-11-29

ecb1623

sparse diff triton kernel optimized

[Sparse NSA Kernel]: Optimize nsa diff kernel's performance

9a10e34

tmp optimized 1201 huangtingwei9988

c3cabfc

ifx

8abfe85

resolve conflicts

4cf64f1

Merge pull request #3 from hzh0425/sparse-framework-1201-huangtingwei…

e15705e

…9988 Sparse framework 1201 huangtingwei9988

[nsa sparse]: Using unified kernel to process all spares requests to …

756feb1

…reduce cpu launch overhead

[Sparse]: Support cuda graph

2265d76

Move Module into mem_cache

2ba37c6

[Sparse]: Optimize code style

12b0a80

fix memory leak

dc1807a

[Sparse]: Fix triton accuracy bug

367dfa2

[io kernel]:optimize transfer kernel temporarily

f506107

Optimize Code Style

d9dbd95

hzh0425 assigned xiezhq-hermann and huangtingwei9988 Dec 8, 2025

hzh0425 requested review from ispobock and zhyncs as code owners December 8, 2025 06:53

hzh0425 added 2 commits December 11, 2025 00:41

[alloc]: fix token pool write

ec6be78

[Sparse]: Refactor algorithm layer

eb7ec6c

hzh0425 force-pushed the hicache/sparse branch from 39b66d8 to eb7ec6c Compare December 11, 2025 07:39

hzh0425 requested a review from fzyzcjy as a code owner December 11, 2025 07:39

hzh0425 added 3 commits December 13, 2025 17:45

[Sparse]: Refactor page_wise_algorithm.py

8912bf7

[Sparse]: Optimize sparse kernel,fix cuda ima

2807ecc

[Sparse]: Refactor algorithm layer again.

4a9bd9d

huangtingwei9988 and others added 3 commits December 16, 2025 10:34

tmp optimization for io_kernel block_quota

5107404

Tmp Support For PD

ff76890

Fix memory_pool_host nsa dim

e56dfd3

hzh0425 requested review from ByronHsu and ShangmingCai as code owners December 18, 2025 02:31

Refactor mem alloc, only seperate alloc index_k on decode mode

bf209e3

hzh0425 requested review from Qiaolin-Yu and hebiao064 as code owners December 22, 2025 17:33

Remove NSAReqToTokenPool

7b5a414

This was referenced Dec 25, 2025

[2/N][Sparse With Hicache]: Support separating nsa memory management for KV cache and index_k in decode side. #15807

Open

[3/N][Sparse With Hicache]: Init sparse coordinator #16086

Merged

hzh0425 and others added 3 commits January 3, 2026 16:08

[1/N][Sparse With Hicache]: Add Sparse Interface (#14741)

9b27093

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: MagicYang1573 <1328657938@qq.com> (cherry picked from commit a89e85e)

Refactoring sparse diff triton kernel

a8f5ffe

support lru evict for sparse kv cache

55faee3

hzh0425 mentioned this pull request Jan 20, 2026

[HiCache]: Support DeepSeek v32 cpu offloading #17415

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32.#14619

[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32.#14619
hzh0425 wants to merge 41 commits intomainfrom
hicache/sparse

hzh0425 commented Dec 8, 2025 •

edited

Loading

Uh oh!

wqlxx commented Dec 15, 2025

Uh oh!

hzh0425 commented Dec 15, 2025

Uh oh!

yuyu5333 commented Jan 6, 2026 •

edited

Loading

Uh oh!

hzh0425 commented Jan 6, 2026 •

edited

Loading

Uh oh!

mikezou2026-zen commented Jan 7, 2026

Uh oh!

hzh0425 commented Jan 7, 2026

Uh oh!

yuyu5333 commented Jan 14, 2026

Uh oh!

hzh0425 commented Jan 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

hzh0425 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

wqlxx commented Dec 15, 2025

Uh oh!

hzh0425 commented Dec 15, 2025

Uh oh!

yuyu5333 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzh0425 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikezou2026-zen commented Jan 7, 2026

Uh oh!

hzh0425 commented Jan 7, 2026

Uh oh!

yuyu5333 commented Jan 14, 2026

Uh oh!

hzh0425 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

hzh0425 commented Dec 8, 2025 •

edited

Loading

yuyu5333 commented Jan 6, 2026 •

edited

Loading

hzh0425 commented Jan 6, 2026 •

edited

Loading

hzh0425 commented Jan 14, 2026 •

edited

Loading