Skip to content

[Feature] Support Efficient Sparse HiP Attention (InfiniteHiP) with Long-Context Generalization and KV Offloading Capabilties#3930

Open
daniel-geon-park wants to merge 34 commits intosgl-project:mainfrom
DeepAuto-AI:deepauto/feat/sglang-pr
Open

[Feature] Support Efficient Sparse HiP Attention (InfiniteHiP) with Long-Context Generalization and KV Offloading Capabilties#3930
daniel-geon-park wants to merge 34 commits intosgl-project:mainfrom
DeepAuto-AI:deepauto/feat/sglang-pr

Conversation

@daniel-geon-park
Copy link

@daniel-geon-park daniel-geon-park commented Feb 27, 2025

Motivation

Dear SGLang maintainers,

We implement InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, the latest version of HiP Attention from our latest ICLR 2025 paper, "A Training-Free Sub-quadratic Cost Transformer Model Serving Framework with Hierarchically Pruned Attention".

Just by enabling our mechanism by the flag --enable-hip-attention, the model is able to generalize beyond the context length it was trained on, and perform efficient sparse attention, without any noticeable degradation on the pretrained model's capabilities.

Furthermore, by using the switch --enable-hip-kv-cache-offload, our (experimental) KV offloading mechanism is available, which alleviates the memory issue with extremely long contexts, which enables serving up to 3 million tokens on a single L40S GPU.

The patches added to the SGLang codebase is minimal, as most of the implementation is in our library.

Modifications

  • Add a switch for HiP Attention
  • Add a custom backend (hip_radix_attention.py) for HiP Attention
  • Add a custom memory pool for KV offloading
  • Adjust the model code (llama, qwen, exaone for now) for RoPE adjustment for HiP Attention

Checklist

@zhaochenyang20
Copy link
Collaborator

@daniel-geon-park Sorry for waiting too long. Could rebase this? We've remind the reviewer and will review and merge it soon.

@HaiShaw HaiShaw added the wip label Mar 4, 2025
@daniel-geon-park daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch 2 times, most recently from 0d0de6c to 1f0c77f Compare March 4, 2025 18:24
@daniel-geon-park
Copy link
Author

@daniel-geon-park Sorry for waiting too long. Could rebase this? We've remind the reviewer and will review and merge it soon.

Thanks! I have squashed and rebased the commits.

@zhaochenyang20
Copy link
Collaborator

Thanks! I will ask hai for help.

@zhaochenyang20
Copy link
Collaborator

@daniel-geon-park hello, could you rebase with the main branch? I asked hai for help again.

@daniel-geon-park daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch from a31aa81 to fdee089 Compare March 10, 2025 05:46
@daniel-geon-park
Copy link
Author

@daniel-geon-park hello, could you rebase with the main branch? I asked hai for help again.

I rebased with the main branch again.

@daniel-geon-park daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch from 0fb9859 to ee467de Compare March 10, 2025 05:50
@HaiShaw
Copy link
Collaborator

HaiShaw commented Mar 12, 2025

@daniel-geon-park sorry for the delay, I am doing this top of anything else.
Thanks for the patience!

@HaiShaw
Copy link
Collaborator

HaiShaw commented Mar 12, 2025

@daniel-geon-park for the hip-attention library and kernels, do they work with both Nvidia and AMD GPUs?

@daniel-geon-park
Copy link
Author

@daniel-geon-park for the hip-attention library and kernels, do they work with both Nvidia and AMD GPUs?

It only works for Nvidia GPUs at the moment.

@HaiShaw
Copy link
Collaborator

HaiShaw commented Mar 14, 2025

@daniel-geon-park do you have this feature compatible with chunk-prefill?

@daniel-geon-park
Copy link
Author

@daniel-geon-park do you have this feature compatible with chunk-prefill?

Yes indeed it is!

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
General comments:

  • Too much interface and class level changes, should look forward to some code refactor to minimize that, and avoid performance impact.
  • Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.
  • Just the 3 models involved?
  • What if run with DeepSeekV3/R1, etc. error out or not support? Need to handle these sorts.
  • Provide example runs with commands and results, please?
  • Where is HiP package to be imported, should update pyproject?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support rope scaling here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_context_length function changes the context length in the model config so that the cached positional embeddings are long enough to handle the extended inputs.

On the other hand, rope scaling keeps the original context length in a separate variable called config.rope_scaling.original_max_position_embeddings, which is not modified here. So this should not interfere with the rope scaling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary to add RoPE within RadixAttention? Generic enough to put here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context extension mechanism of HiP Attention requires that the RoPE is not pre-applied to the queries and keys, but instead it is applied inside the attention kernel. Therefore, we need to somehow pass the cached RoPE to the attention backend. If you have a better idea for this, then please let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on RoPE above

@HaiShaw
Copy link
Collaborator

HaiShaw commented Mar 16, 2025

@merrymercy May you also have a look?

@kbumsik
Copy link

kbumsik commented Mar 18, 2025

  • Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.

Hi. I am a infra engineer supporting this PR. Currently we do not have an AMD hardware to test on cloud, but I will take a look if we can get some cloud AMD hardware within the budget, and if @daniel-geon-park and @gmlwns2000 can work on AMD hardware.

But the only cloud hardware I can find is MI300X on Runpod. Would testing on MI300X be sufficient? I would appreciate if you know any other AMD GPU cloud provider.

@HaiShaw
Copy link
Collaborator

HaiShaw commented Mar 18, 2025

  • Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.

Hi. I am a infra engineer supporting this PR. Currently we do not have an AMD hardware to test on cloud, but I will take a look if we can get some cloud AMD hardware within the budget, and if @daniel-geon-park and @gmlwns2000 can work on AMD hardware.

But the only cloud hardware I can find is MI300X on Runpod. Would testing on MI300X be sufficient? I would appreciate if you know any other AMD GPU cloud provider.

@kbumsik Thanks for the question. MI300X should be sufficient, I will try to help arranging an access if you really can't find. Ping me in Slack if needed.

@daniel-geon-park
Copy link
Author

daniel-geon-park commented Mar 18, 2025

  • Too much interface and class level changes, should look forward to some code refactor to minimize that, and avoid performance impact.

We tried to minimize the interface level changes, but I believe this is the minimal amount of change to make our method work in SGLang. If you have some suggestions, I'd be happy to reflect those.

  • Just the 3 models involved?
  • What if run with DeepSeekV3/R1, etc. error out or not support? Need to handle these sorts.

That's a fair point. We added a logic to error out with unsupported models. We'll incorporate more popular models such as DeepSeekV3/R1.

  • Provide example runs with commands and results, please?

Sure. Please refer to the following page for example commands:

https://github.com/DeepAuto-AI/hip-attention/blob/deepauto/dev/docs/USAGE.sglang.md

  • Where is HiP package to be imported, should update pyproject?

We are concurrently working to add the HiP package to the PyPI index, but it has not been published yet. We will update as soon as it is online.

@kbumsik
Copy link

kbumsik commented Mar 19, 2025

@kbumsik Thanks for the question. MI300X should be sufficient, I will try to help arranging an access if you really can't find. Ping me in Slack if needed.

Thank you for answering! Unfortunately we are unable to get budget AMD GPU at the moment:

  • Runpod: Says it is currently out of stock.
  • Vultr: Currently requires long-term contract.
  • Azure: Seems not available for new customer yet (we have been using AWS)

So could I possibly ask you for arranging an access? I am not sure your nickname on Slack to send a DM, so I am posting here. What is your nickname on slack? or you can find me on SGLang slack (kbumsik@gmail.com) btw.

@daniel-geon-park daniel-geon-park force-pushed the deepauto/feat/sglang-pr branch from ab3f6f0 to 5444880 Compare May 21, 2025 20:00
@daniel-geon-park
Copy link
Author

@zhaochenyang20 @HaiShaw

Bumping this PR to move it forward. Can it be merged? If not, what blocks it from being merged?

@sleepwalker2017
Copy link

Nice work! How does it work for multi modal models?

@gmlwns2000
Copy link
Contributor

gmlwns2000 commented May 31, 2025

Hi @sleepwalker2017 ,

We tested on Qwen2.5 VL and Llama 4 models with HiP Attention. We replace the attention mechanisms of LLM with HiP.

I am not sure this PR contains patches for handling vision models, but my working branch (feat/delta on DeepAuto-AI/sglang and DeepAuto-AI/hip-attention) definitely handles the vision models.

Here are our draft performance results:

LongVideoBench (first 748 samples) T (k) FA Ours
Llama4 Scout 109B 256 52.27 51.07
Qwen2.5 VL 32B 128 56.15 54.28

I am not sure the above result is sufficiently optimized, because I simply copied almost the same hyperparameters from language models with HiP, so probably the performance will be increased when we optimize the hyperparameters.

To reproduce, this is the command to run SGlang with vision models:

# Llama4 + HiP

SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=0 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 20000 --tp 8 --max-total-tokens 200000 --context-length 200000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2, 3]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template llama-4 --enable-hip-attention

# QwenVL 2.5 + HiP

VIDEO_MAX_PIXELS=$((128000 * 28 * 25)) \
SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=1 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-72B-Instruct-AWQ \
  --port 30000 --tp 8 --max-total-tokens 256000 --context-length 256000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": true, "dense_layers": [0, 1, 2, 3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 79], "layers": [{"sliding_window_size": "122880", "second_stage_k": 4096, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}, {"sliding_window_size": 4096, "second_stage_k": 2048, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template qwen2-vl \
  --kv-cache-dtype fp8_e5m2 --enable-hip-attention

@sleepwalker2017
Copy link

Hi @sleepwalker2017 ,

We tested on Qwen2.5 VL and Llama 4 models with HiP Attention. We replace the attention mechanisms of LLM with HiP.

I am not sure this PR contains patches for handling vision models, but my working branch (feat/delta on DeepAuto-AI/sglang and DeepAuto-AI/hip-attention) definitely handles the vision models.

Here are our draft performance results:

LongVideoBench (first 748 samples) T (k) FA Ours
Llama4 Scout 109B 256 52.27 51.07
Qwen2.5 VL 32B 128 56.15 54.28
I am not sure the above result is sufficiently optimized, because I simply copied almost the same hyperparameters from language models with HiP, so probably the performance will be increased when we optimize the hyperparameters.

To reproduce, this is the command to run SGlang with vision models:

# Llama4 + HiP

SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=0 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 20000 --tp 8 --max-total-tokens 200000 --context-length 200000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2, 3]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template llama-4 --enable-hip-attention

# QwenVL 2.5 + HiP

VIDEO_MAX_PIXELS=$((128000 * 28 * 25)) \
SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=1 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-72B-Instruct-AWQ \
  --port 30000 --tp 8 --max-total-tokens 256000 --context-length 256000 \
  --cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
  --hip-attention-config '{"using_extend": true, "dense_layers": [0, 1, 2, 3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 79], "layers": [{"sliding_window_size": "122880", "second_stage_k": 4096, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}, {"sliding_window_size": 4096, "second_stage_k": 2048, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}]}' \
  --disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
  --enable-multimodal --chat-template qwen2-vl \
  --kv-cache-dtype fp8_e5m2 --enable-hip-attention

What's the numbers mean in the sheet?

Time cost or throughput or what? Thank you!

@gmlwns2000
Copy link
Contributor

@sleepwalker2017

The table shows the model performance on LongVideoBench (accuracy). We did not measure the latency on that benchmark yet, but our paper might be helpful to understand the latency of our approach.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants