Skip to content

[1/N][Sparse With Hicache]: Add Sparse Interface#14741

Merged
xiezhq-hermann merged 7 commits intosgl-project:mainfrom
hzh0425:sparse/algorithm_interface1
Dec 25, 2025
Merged

[1/N][Sparse With Hicache]: Add Sparse Interface#14741
xiezhq-hermann merged 7 commits intosgl-project:mainfrom
hzh0425:sparse/algorithm_interface1

Conversation

@hzh0425
Copy link
Collaborator

@hzh0425 hzh0425 commented Dec 9, 2025

Motivation

This PR primarily introduces a generic Retrievable Sparse Algorithm interface.
Upstream PR:#14619

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hzh0425, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a foundational interface for integrating diverse sparse attention algorithms within the system. By introducing a BaseSparseAlgorithm and SparseMode enum, it provides a standardized way to implement and manage KV cache compression techniques. The PR also includes two initial algorithm implementations, DeepSeekNSAAlgorithm and KnormPageAlgorithm, demonstrating the extensibility of this new framework. This change is crucial for enabling more efficient memory usage and potentially faster inference by selectively attending to important KV cache entries.

Highlights

  • Introduced BaseSparseAlgorithm: A new abstract base class for defining various retrievable KV cache compression algorithms, supporting both page-wise and token-wise sparsity.
  • Defined SparseMode Enum: An enumeration to categorize sparse attention granularity modes, including PAGE_WISE, TOKEN_WISE, and DEEPSEEK_TOKEN_WISE.
  • Added DeepSeekNSAAlgorithm: A concrete implementation for handling DeepSeek NSA specific sparse attention logic.
  • Implemented KnormPageAlgorithm: An experimental page-wise sparse attention algorithm utilizing ChunkKV-style scoring based on key L2 norms.
  • Included Unit Tests: Comprehensive unit tests for the KnormPageAlgorithm to ensure its correctness and functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable abstraction for sparse attention algorithms by adding a generic BaseSparseAlgorithm interface. The changes include the base class, two initial implementations (DeepSeekNSAAlgorithm and KnormPageAlgorithm), and unit tests for one of them. My review identifies a critical issue in DeepSeekNSAAlgorithm that would lead to a runtime error due to an incorrect return value. I've also pointed out several medium-severity issues in KnormPageAlgorithm related to hardcoded values, dead code, incorrect type hints, and verbose logging that could impact maintainability and performance. Overall, this is a solid architectural improvement, and addressing these points will enhance the robustness and quality of the new sparse attention framework.

@hzh0425 hzh0425 added the run-ci label Dec 9, 2025
@hzh0425 hzh0425 force-pushed the sparse/algorithm_interface1 branch from 199f115 to 9c9d83a Compare December 9, 2025 13:45
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
@xiezhq-hermann xiezhq-hermann self-assigned this Dec 10, 2025
@hzh0425 hzh0425 marked this pull request as ready for review December 11, 2025 07:47
@hzh0425 hzh0425 force-pushed the sparse/algorithm_interface1 branch from 37c9b5c to 4d1a182 Compare December 11, 2025 08:45
@hzh0425 hzh0425 force-pushed the sparse/algorithm_interface1 branch from a07afd8 to 205c2bc Compare December 17, 2025 10:47
@hzh0425 hzh0425 added hicache Hierarchical Caching for SGLang and removed deepseek sgl-kernel labels Dec 17, 2025
@hzh0425
Copy link
Collaborator Author

hzh0425 commented Dec 20, 2025

we performed a refactoring by removing the distinction between PageWise and TokenWise, as TokenWise is essentially a special case of PageWise with PageSize = 1.

In base_algorithm.py, I implemented a BaseSparseAlgorithmImpl as a common base class. Subclasses can integrate into the pipeline by implementing a few key methods, and they also have the flexibility to override critical methods such as retrieve_topk and update_representations to customize their specific logic.

Additionally, we plan to improve the performance of BaseSparseAlgorithmImpl in the next PR.
This PR focuses on defining the foundational interfaces and Implement a reasonable sparse Algo.

Could you please review it again when you have time? @xiezhq-hermann

The Accuracy Test for Quest Sparse Algo(Implemented by @magicYang1573):

GSM8K

llama 8B Test: (The accuracy (ACC) of the original model is 0.94.)

CompressionRate 0.3 0.5 0.7
Accuracy ACC 0.92 0.92 0.92

aime25

Qwen32B Test:

Original Mode:
image

Quest Sparse Mode with 0.5 compresstion rate
image

@xiezhq-hermann xiezhq-hermann added the ready-to-merge The PR is ready to merge after the CI is green. label Dec 23, 2025
@hzh0425
Copy link
Collaborator Author

hzh0425 commented Dec 23, 2025

/rerun-failed-ci

@xiezhq-hermann xiezhq-hermann merged commit a89e85e into sgl-project:main Dec 25, 2025
177 of 189 checks passed
Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: MagicYang1573 <1328657938@qq.com>
Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: MagicYang1573 <1328657938@qq.com>
hzh0425 added a commit that referenced this pull request Jan 6, 2026
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: MagicYang1573 <1328657938@qq.com>

(cherry picked from commit a89e85e)
yuyu5333 pushed a commit to yuyu5333/sglang-bytedance that referenced this pull request Jan 7, 2026
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: MagicYang1573 <1328657938@qq.com>

(cherry picked from commit a89e85e)
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: MagicYang1573 <1328657938@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek hicache Hierarchical Caching for SGLang ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants