[Feature] Support Efficient Sparse HiP Attention (InfiniteHiP) with Long-Context Generalization and KV Offloading Capabilties by daniel-geon-park · Pull Request #3930 · sgl-project/sglang

daniel-geon-park · 2025-02-27T15:38:54Z

Motivation

Dear SGLang maintainers,

We implement InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, the latest version of HiP Attention from our latest ICLR 2025 paper, "A Training-Free Sub-quadratic Cost Transformer Model Serving Framework with Hierarchically Pruned Attention".

Just by enabling our mechanism by the flag --enable-hip-attention, the model is able to generalize beyond the context length it was trained on, and perform efficient sparse attention, without any noticeable degradation on the pretrained model's capabilities.

Furthermore, by using the switch --enable-hip-kv-cache-offload, our (experimental) KV offloading mechanism is available, which alleviates the memory issue with extremely long contexts, which enables serving up to 3 million tokens on a single L40S GPU.

The patches added to the SGLang codebase is minimal, as most of the implementation is in our library.

Modifications

Add a switch for HiP Attention
Add a custom backend (hip_radix_attention.py) for HiP Attention
Add a custom memory pool for KV offloading
Adjust the model code (llama, qwen, exaone for now) for RoPE adjustment for HiP Attention

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhaochenyang20 · 2025-03-04T08:09:28Z

@daniel-geon-park Sorry for waiting too long. Could rebase this? We've remind the reviewer and will review and merge it soon.

daniel-geon-park · 2025-03-04T18:27:16Z

@daniel-geon-park Sorry for waiting too long. Could rebase this? We've remind the reviewer and will review and merge it soon.

Thanks! I have squashed and rebased the commits.

zhaochenyang20 · 2025-03-05T05:42:50Z

Thanks! I will ask hai for help.

zhaochenyang20 · 2025-03-07T07:34:06Z

@daniel-geon-park hello, could you rebase with the main branch? I asked hai for help again.

daniel-geon-park · 2025-03-10T05:47:14Z

@daniel-geon-park hello, could you rebase with the main branch? I asked hai for help again.

I rebased with the main branch again.

HaiShaw · 2025-03-12T06:36:53Z

@daniel-geon-park sorry for the delay, I am doing this top of anything else.
Thanks for the patience!

HaiShaw · 2025-03-12T08:23:54Z

@daniel-geon-park for the hip-attention library and kernels, do they work with both Nvidia and AMD GPUs?

daniel-geon-park · 2025-03-12T13:32:27Z

@daniel-geon-park for the hip-attention library and kernels, do they work with both Nvidia and AMD GPUs?

It only works for Nvidia GPUs at the moment.

HaiShaw · 2025-03-14T08:24:58Z

@daniel-geon-park do you have this feature compatible with chunk-prefill?

daniel-geon-park · 2025-03-14T16:34:48Z

@daniel-geon-park do you have this feature compatible with chunk-prefill?

Yes indeed it is!

HaiShaw

Thanks for the contribution!
General comments:

Too much interface and class level changes, should look forward to some code refactor to minimize that, and avoid performance impact.
Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.
Just the 3 models involved?
What if run with DeepSeekV3/R1, etc. error out or not support? Need to handle these sorts.
Provide example runs with commands and results, please?
Where is HiP package to be imported, should update pyproject?

HaiShaw · 2025-03-16T02:14:36Z

python/sglang/srt/hf_transformers_utils.py

Do we support rope scaling here?

The update_context_length function changes the context length in the model config so that the cached positional embeddings are long enough to handle the extended inputs.

On the other hand, rope scaling keeps the original context length in a separate variable called config.rope_scaling.original_max_position_embeddings, which is not modified here. So this should not interfere with the rope scaling.

python/sglang/srt/layers/attention/hip_radix_attention.py

HaiShaw · 2025-03-16T06:59:55Z

python/sglang/srt/layers/radix_attention.py

Necessary to add RoPE within RadixAttention? Generic enough to put here?

the context extension mechanism of HiP Attention requires that the RoPE is not pre-applied to the queries and keys, but instead it is applied inside the attention kernel. Therefore, we need to somehow pass the cached RoPE to the attention backend. If you have a better idea for this, then please let me know.

HaiShaw · 2025-03-16T07:00:16Z

python/sglang/srt/layers/radix_attention.py

Same comment on RoPE above

python/sglang/srt/model_executor/cuda_graph_runner.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/server_args.py

HaiShaw · 2025-03-16T08:32:03Z

@merrymercy May you also have a look?

kbumsik · 2025-03-18T04:55:14Z

Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.

Hi. I am a infra engineer supporting this PR. Currently we do not have an AMD hardware to test on cloud, but I will take a look if we can get some cloud AMD hardware within the budget, and if @daniel-geon-park and @gmlwns2000 can work on AMD hardware.

But the only cloud hardware I can find is MI300X on Runpod. Would testing on MI300X be sufficient? I would appreciate if you know any other AMD GPU cloud provider.

HaiShaw · 2025-03-18T05:01:36Z

Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.

Hi. I am a infra engineer supporting this PR. Currently we do not have an AMD hardware to test on cloud, but I will take a look if we can get some cloud AMD hardware within the budget, and if @daniel-geon-park and @gmlwns2000 can work on AMD hardware.

But the only cloud hardware I can find is MI300X on Runpod. Would testing on MI300X be sufficient? I would appreciate if you know any other AMD GPU cloud provider.

@kbumsik Thanks for the question. MI300X should be sufficient, I will try to help arranging an access if you really can't find. Ping me in Slack if needed.

daniel-geon-park · 2025-03-18T21:28:43Z

Too much interface and class level changes, should look forward to some code refactor to minimize that, and avoid performance impact.

We tried to minimize the interface level changes, but I believe this is the minimal amount of change to make our method work in SGLang. If you have some suggestions, I'd be happy to reflect those.

Just the 3 models involved?

What if run with DeepSeekV3/R1, etc. error out or not support? Need to handle these sorts.

That's a fair point. We added a logic to error out with unsupported models. We'll incorporate more popular models such as DeepSeekV3/R1.

Provide example runs with commands and results, please?

Sure. Please refer to the following page for example commands:

https://github.com/DeepAuto-AI/hip-attention/blob/deepauto/dev/docs/USAGE.sglang.md

Where is HiP package to be imported, should update pyproject?

We are concurrently working to add the HiP package to the PyPI index, but it has not been published yet. We will update as soon as it is online.

kbumsik · 2025-03-19T13:03:29Z

@kbumsik Thanks for the question. MI300X should be sufficient, I will try to help arranging an access if you really can't find. Ping me in Slack if needed.

Thank you for answering! Unfortunately we are unable to get budget AMD GPU at the moment:

Runpod: Says it is currently out of stock.
Vultr: Currently requires long-term contract.
Azure: Seems not available for new customer yet (we have been using AWS)

So could I possibly ask you for arranging an access? I am not sure your nickname on Slack to send a DM, so I am posting here. What is your nickname on slack? or you can find me on SGLang slack (kbumsik@gmail.com) btw.

# Conflicts: # python/sglang/srt/models/llama4.py

# Conflicts: # python/sglang/srt/server_args.py

daniel-geon-park · 2025-05-21T20:02:30Z

@zhaochenyang20 @HaiShaw

Bumping this PR to move it forward. Can it be merged? If not, what blocks it from being merged?

sleepwalker2017 · 2025-05-30T16:13:36Z

Nice work! How does it work for multi modal models?

gmlwns2000 · 2025-05-31T00:07:51Z

Hi @sleepwalker2017 ,

We tested on Qwen2.5 VL and Llama 4 models with HiP Attention. We replace the attention mechanisms of LLM with HiP.

I am not sure this PR contains patches for handling vision models, but my working branch (feat/delta on DeepAuto-AI/sglang and DeepAuto-AI/hip-attention) definitely handles the vision models.

Here are our draft performance results:

LongVideoBench (first 748 samples)	T (k)	FA	Ours
Llama4 Scout 109B	256	52.27	51.07
Qwen2.5 VL 32B	128	56.15	54.28

I am not sure the above result is sufficiently optimized, because I simply copied almost the same hyperparameters from language models with HiP, so probably the performance will be increased when we optimize the hyperparameters.

To reproduce, this is the command to run SGlang with vision models:

# Llama4 + HiP

SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=0 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 20000 --tp 8 --max-total-tokens 200000 --context-length 200000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2, 3]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template llama-4 --enable-hip-attention

# QwenVL 2.5 + HiP

VIDEO_MAX_PIXELS=$((128000 * 28 * 25)) \
SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=1 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-72B-Instruct-AWQ \
  --port 30000 --tp 8 --max-total-tokens 256000 --context-length 256000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": true, "dense_layers": [0, 1, 2, 3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 79], "layers": [{"sliding_window_size": "122880", "second_stage_k": 4096, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}, {"sliding_window_size": 4096, "second_stage_k": 2048, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template qwen2-vl \
  --kv-cache-dtype fp8_e5m2 --enable-hip-attention

sleepwalker2017 · 2025-06-05T03:24:52Z

Hi @sleepwalker2017 ,

We tested on Qwen2.5 VL and Llama 4 models with HiP Attention. We replace the attention mechanisms of LLM with HiP.

I am not sure this PR contains patches for handling vision models, but my working branch (feat/delta on DeepAuto-AI/sglang and DeepAuto-AI/hip-attention) definitely handles the vision models.

Here are our draft performance results:

LongVideoBench (first 748 samples) T (k) FA Ours
Llama4 Scout 109B 256 52.27 51.07
Qwen2.5 VL 32B 128 56.15 54.28
I am not sure the above result is sufficiently optimized, because I simply copied almost the same hyperparameters from language models with HiP, so probably the performance will be increased when we optimize the hyperparameters.

To reproduce, this is the command to run SGlang with vision models:

# Llama4 + HiP

SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=0 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 20000 --tp 8 --max-total-tokens 200000 --context-length 200000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2, 3]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template llama-4 --enable-hip-attention

# QwenVL 2.5 + HiP

VIDEO_MAX_PIXELS=$((128000 * 28 * 25)) \
SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=1 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-72B-Instruct-AWQ \
  --port 30000 --tp 8 --max-total-tokens 256000 --context-length 256000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": true, "dense_layers": [0, 1, 2, 3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 79], "layers": [{"sliding_window_size": "122880", "second_stage_k": 4096, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}, {"sliding_window_size": 4096, "second_stage_k": 2048, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template qwen2-vl \
  --kv-cache-dtype fp8_e5m2 --enable-hip-attention

What's the numbers mean in the sheet?

Time cost or throughput or what? Thank you!

gmlwns2000 · 2025-06-05T03:31:05Z

@sleepwalker2017

The table shows the model performance on LongVideoBench (accuracy). We did not measure the latency on that benchmark yet, but our paper might be helpful to understand the latency of our approach.

Thanks!

daniel-geon-park requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners February 27, 2025 15:38

zhyncs requested a review from xiezhq-hermann February 27, 2025 16:39

merrymercy requested a review from HaiShaw as a code owner March 3, 2025 08:12

Swipe4057 mentioned this pull request Mar 4, 2025

Development Roadmap (2025 H1) #4042

Closed

67 tasks

HaiShaw added the wip label Mar 4, 2025

daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch 2 times, most recently from 0d0de6c to 1f0c77f Compare March 4, 2025 18:24

daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch from a31aa81 to fdee089 Compare March 10, 2025 05:46

daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch from 0fb9859 to ee467de Compare March 10, 2025 05:50

HaiShaw requested changes Mar 16, 2025

View reviewed changes

kbumsik and others added 22 commits May 21, 2025 16:00

feat: bump hip-attn

b6593e2

bump hip-attn version

83ee908

cleanup rebase

7433132

fix for MLA

a0a83f9

fix args

526169e

update tests

01c27b8

bump hip-attn version

ac8c437

cleanup after rebase

372472b

remove redundant code

28050d4

fix

a7404cf

# Conflicts: # python/sglang/srt/models/llama4.py

fix bug

afa8f3d

pull from updated fa3 backend

a1b50c5

handling chunk attention in hip

ae08a8d

PASSKEY, you should remove before PR

1c7e9d9

fix

cb09c9f

fix

d5598bb

fix

c2e90eb

# Conflicts: # python/sglang/srt/server_args.py

fix

00ce797

cleaning

7efec79

fmt

0de08dc

support qwen2.5 vL

8b55aba

cleanup

5444880

daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch from ab3f6f0 to 5444880 Compare May 21, 2025 20:00

merrymercy requested review from Edwardf0t1 and Fridge003 as code owners November 29, 2025 07:06

Conversation

daniel-geon-park commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 commented Mar 4, 2025

Uh oh!

daniel-geon-park commented Mar 4, 2025

Uh oh!

zhaochenyang20 commented Mar 5, 2025

Uh oh!

zhaochenyang20 commented Mar 7, 2025

Uh oh!

daniel-geon-park commented Mar 10, 2025

Uh oh!

HaiShaw commented Mar 12, 2025

Uh oh!

HaiShaw commented Mar 12, 2025

Uh oh!

daniel-geon-park commented Mar 12, 2025

Uh oh!

HaiShaw commented Mar 14, 2025

Uh oh!

daniel-geon-park commented Mar 14, 2025

Uh oh!

HaiShaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaiShaw Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

daniel-geon-park Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HaiShaw Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

daniel-geon-park Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

HaiShaw Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HaiShaw commented Mar 16, 2025

Uh oh!

kbumsik commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaiShaw commented Mar 18, 2025

Uh oh!

daniel-geon-park commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbumsik commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-geon-park commented May 21, 2025

Uh oh!

sleepwalker2017 commented May 30, 2025

Uh oh!

gmlwns2000 commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sleepwalker2017 commented Jun 5, 2025

Uh oh!

gmlwns2000 commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

daniel-geon-park commented Feb 27, 2025 •

edited

Loading

HaiShaw left a comment •

edited

Loading

kbumsik commented Mar 18, 2025 •

edited

Loading

daniel-geon-park commented Mar 18, 2025 •

edited

Loading

kbumsik commented Mar 19, 2025 •

edited

Loading

gmlwns2000 commented May 31, 2025 •

edited

Loading