[Feature] Support Efficient Sparse HiP Attention (InfiniteHiP) with Long-Context Generalization and KV Offloading Capabilties#3930
Conversation
|
@daniel-geon-park Sorry for waiting too long. Could rebase this? We've remind the reviewer and will review and merge it soon. |
0d0de6c to
1f0c77f
Compare
Thanks! I have squashed and rebased the commits. |
|
Thanks! I will ask hai for help. |
|
@daniel-geon-park hello, could you rebase with the main branch? I asked hai for help again. |
a31aa81 to
fdee089
Compare
I rebased with the main branch again. |
0fb9859 to
ee467de
Compare
|
@daniel-geon-park sorry for the delay, I am doing this top of anything else. |
|
@daniel-geon-park for the hip-attention library and kernels, do they work with both Nvidia and AMD GPUs? |
It only works for Nvidia GPUs at the moment. |
|
@daniel-geon-park do you have this feature compatible with chunk-prefill? |
Yes indeed it is! |
There was a problem hiding this comment.
Thanks for the contribution!
General comments:
- Too much interface and class level changes, should look forward to some code refactor to minimize that, and avoid performance impact.
- Need to explore the options to have it on AMD/ROCm, or give log/assert info from args options on ROCm.
- Just the 3 models involved?
- What if run with DeepSeekV3/R1, etc. error out or not support? Need to handle these sorts.
- Provide example runs with commands and results, please?
- Where is HiP package to be imported, should update
pyproject?
There was a problem hiding this comment.
Do we support rope scaling here?
There was a problem hiding this comment.
The update_context_length function changes the context length in the model config so that the cached positional embeddings are long enough to handle the extended inputs.
On the other hand, rope scaling keeps the original context length in a separate variable called config.rope_scaling.original_max_position_embeddings, which is not modified here. So this should not interfere with the rope scaling.
There was a problem hiding this comment.
Necessary to add RoPE within RadixAttention? Generic enough to put here?
There was a problem hiding this comment.
the context extension mechanism of HiP Attention requires that the RoPE is not pre-applied to the queries and keys, but instead it is applied inside the attention kernel. Therefore, we need to somehow pass the cached RoPE to the attention backend. If you have a better idea for this, then please let me know.
There was a problem hiding this comment.
Same comment on RoPE above
|
@merrymercy May you also have a look? |
Hi. I am a infra engineer supporting this PR. Currently we do not have an AMD hardware to test on cloud, but I will take a look if we can get some cloud AMD hardware within the budget, and if @daniel-geon-park and @gmlwns2000 can work on AMD hardware. But the only cloud hardware I can find is MI300X on Runpod. Would testing on MI300X be sufficient? I would appreciate if you know any other AMD GPU cloud provider. |
@kbumsik Thanks for the question. MI300X should be sufficient, I will try to help arranging an access if you really can't find. Ping me in Slack if needed. |
We tried to minimize the interface level changes, but I believe this is the minimal amount of change to make our method work in SGLang. If you have some suggestions, I'd be happy to reflect those.
That's a fair point. We added a logic to error out with unsupported models. We'll incorporate more popular models such as DeepSeekV3/R1.
Sure. Please refer to the following page for example commands: https://github.com/DeepAuto-AI/hip-attention/blob/deepauto/dev/docs/USAGE.sglang.md
We are concurrently working to add the HiP package to the PyPI index, but it has not been published yet. We will update as soon as it is online. |
Thank you for answering! Unfortunately we are unable to get budget AMD GPU at the moment:
So could I possibly ask you for arranging an access? I am not sure your nickname on Slack to send a DM, so I am posting here. What is your nickname on slack? or you can find me on SGLang slack (kbumsik@gmail.com) btw. |
ab3f6f0 to
5444880
Compare
|
Bumping this PR to move it forward. Can it be merged? If not, what blocks it from being merged? |
|
Nice work! How does it work for multi modal models? |
|
Hi @sleepwalker2017 , We tested on Qwen2.5 VL and Llama 4 models with HiP Attention. We replace the attention mechanisms of LLM with HiP. I am not sure this PR contains patches for handling vision models, but my working branch ( Here are our draft performance results:
I am not sure the above result is sufficiently optimized, because I simply copied almost the same hyperparameters from language models with HiP, so probably the performance will be increased when we optimize the hyperparameters. To reproduce, this is the command to run SGlang with vision models: # Llama4 + HiP
SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=0 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--port 20000 --tp 8 --max-total-tokens 200000 --context-length 200000 \
--cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
--hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2, 3]}' \
--disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
--enable-multimodal --chat-template llama-4 --enable-hip-attention
# QwenVL 2.5 + HiP
VIDEO_MAX_PIXELS=$((128000 * 28 * 25)) \
SGLANG_IMAGE_PROCESSOR_FAST_DEVICE=cpu \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
HIP_DISABLE_FLASHDECODE=1 \
HIP_HEAD_REDUCE=1 \
HIP_DEBUG_LAST_DENSE=128 \
HIP_DEBUG_FORCE_DENSE_DECODE=1 \
CUDA_LAUNCH_BLOCKING=0 \
HIP_DEBUG=0 \
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-VL-72B-Instruct-AWQ \
--port 30000 --tp 8 --max-total-tokens 256000 --context-length 256000 \
--cuda-graph-bs 1 --max-running-req 1 --chunked-prefill-size -1 \
--hip-attention-config '{"using_extend": true, "dense_layers": [0, 1, 2, 3, 11, 19, 27, 35, 43, 51, 59, 67, 75, 79], "layers": [{"sliding_window_size": "122880", "second_stage_k": 4096, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}, {"sliding_window_size": 4096, "second_stage_k": 2048, "sa_extend_backend": "streaming", "scan_extend_backend": "streaming", "stages": [{"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 128, "stage_k": null, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 4, "stage_chunk_size": 32, "stage_k": 32768, "stage_stride": 1}, {"stage_block_size_q": 64, "stage_block_stride_q": 1, "stage_chunk_size": 8, "stage_k": 8192, "stage_stride": 1}]}]}' \
--disable-radix-cache --attention-backend flashinfer --disable-cuda-graph \
--enable-multimodal --chat-template qwen2-vl \
--kv-cache-dtype fp8_e5m2 --enable-hip-attention |
What's the numbers mean in the sheet? Time cost or throughput or what? Thank you! |
|
The table shows the model performance on LongVideoBench (accuracy). We did not measure the latency on that benchmark yet, but our paper might be helpful to understand the latency of our approach. Thanks! |
Motivation
Dear SGLang maintainers,
We implement InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, the latest version of HiP Attention from our latest ICLR 2025 paper, "A Training-Free Sub-quadratic Cost Transformer Model Serving Framework with Hierarchically Pruned Attention".
Just by enabling our mechanism by the flag
--enable-hip-attention, the model is able to generalize beyond the context length it was trained on, and perform efficient sparse attention, without any noticeable degradation on the pretrained model's capabilities.Furthermore, by using the switch
--enable-hip-kv-cache-offload, our (experimental) KV offloading mechanism is available, which alleviates the memory issue with extremely long contexts, which enables serving up to 3 million tokens on a single L40S GPU.The patches added to the SGLang codebase is minimal, as most of the implementation is in our library.
Modifications
Checklist