Releases: HabanaAI/vllm-fork
v0.9.0.1+Gaudi-1.23.0
vLLM with Intel® Gaudi® AI Accelerators
This version is based on vLLM 0.9.0.1 and supports Intel® Gaudi® v1.23.0.
What's Changed
- Enable interleaved sliding_window for gemma3 by @jiminha in #1344
- docker vllm: update readme by @tthaddey in #1525
- Update hpu-ext sha and temporarely disable deepseek test by @kwisniewski98 in #1534
- [SW-234006] Fix requirements by @szutenberg in #1531
- Embedding fix: warmup failure in embedding model by @shepark in #1510
- add split qkv to gemma3 by @skaulintel in #1517
- Enable vision bucketing/warmup for gemma3 model by @libinta in #1470
- [SW-234248] Set pytorch version for version_branch by @tlipinski1337 in #1553
- change vllm-hpu-extension sha to 71a66fb by @iboiko-habana in #1544
- docker vllm: cleanup configs and add missing models by @tthaddey in #1549
- Add accelerate to requirements/hpu.txt by @tpawlows in #1564
- Fix AttributeError: 'NoneType' object has no attribute 'getenv' by @michalkuligowski in #1554
- Set hpu-extension to 222acde by @ksmusz in #1545
- Fix V1 uniproc executor segfaults by @kzawora-intel in #1570
- Update Force Channel FP8 Check by @yiliu30 in #1561
- docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct by @tthaddey in #1577
- Fix calling shutdown inc in del by @michalkuligowski in #1574
- [SW-234248] Take pytorch version directly from bridge repo by @tlipinski1337 in #1572
- fix prompt_logprob crash when delayed sampling is on by @ccrhx4 in #1421
- [V0] Use device as the set_device's parameter by default, update proxy of pd by @zhenwei-intel in #1540
- Ray vllm example in 'latest' link does not work README_GAUDI.md by @michalkuligowski in #1390
- [SW-234741] Use internal token for fetching pt_version by @tlipinski1337 in #1583
- Readme warmup update by @adobrzyn in #1512
- Change vllm-hpu-extension revision to f831cb1 by @iboiko-habana in #1587
- Updated README_GAUDI.md with gaudinet.json prereq by @anastasiauvarovaintel in #1588
- Num blocks fix - V1 by @adobrzyn in #1594
- docker vllm: Split entrypoints in separate clases and update vllm installation in docker by @tthaddey in #1602
- V1 - dont look for bucket we know don't exists by @adobrzyn in #1606
- Added support for FusedSDPA kernel with window_size for Gemma3 by @jiminha in #1589
- remove logic that uses more memory in prepare_attn_masks by @libinta in #1597
- Fix AttributeError during shutdown of RayDistributedExecutor by @tpawlows in #1599
- Fix warmup skip and cleanup for gemma3-vl by @libinta in #1623
- Update extension - Fix fallback buckets by @adobrzyn in #1624
- [SW-235047] use w8a8 path for per_channel for performance regression fixing by @xuechendi in #1629
- Port high-level profiler to V1 engine by @jkaniecki in #1501
- [V1][MLA][SW-234434] Enable MLA for V1 - ported from vllm-gaudi by @xuechendi in #1628
- gemma3: fix accuracy issue caused by not skipping image on top right by @libinta in #1635
- Fix: Round up to sliding window threshold - update extension by @adobrzyn in #1637
- Enable LMCache for cpuoffloading, LMCache docker support, enable lmcache by @shepark in #1645
- [Security] Fix: Structurally dead code (#1625) by @afierka-intel in #1639
- [Security] Fix: Bad use of null-like value (#1634) by @afierka-intel in #1640
- Update hpu.txt by @afierka-intel in #1654
- Remove dtype.float16 support for hpu config by @iboiko-habana in #1657
- [SW-234344] Fix 'RotaryEmbedding' object has no attribute 'sin' by @xuechendi in #1658
- ValueError: 'aimv2' is already used by a Transformers config by @michalkuligowski in #1680
- [V1] Defragmentation support by @madamczyk-intel in #1568
- Set hpu-extension to 6b2f6fb by @ksmusz in #1684
- Remove inference_mode() from platforms.hpu by @jkaniecki in #1691
- Remove V1 HPU support from the fork by @kzawora-intel in #1707
- skip softmax/log_softmax when greedy_sampling with no logprobs by @xuechendi in #1711
- [SW-234516] Fix padding for padding aware path by @PatrykWilczewski in #1702
- [SW-234805] Fix target_device for weights load by @kfojcik-intel in #1733
- use value for mrope check by @xuechendi in #1740
- move detoken to serving_client by @xuechendi in #1741
- add env and remove mark_step by @xuechendi in #1739
- Fix Data Parallel by @xinyu-intel in #1742
- [SW-236277] Fix RotaryEmbedding cos-sin prepare by @kfojcik-intel in #1765
- updata vllm hpu extenstion commit by @xuechendi in #1759
- Update TESTOWNERS by @mgawarkiewicz-intel in #1757
- adding wpyszka to codeowners by @wpyszka in #1480
- Fix text-only prompt in Llama Vision (#1621) by @kdamaszk in #1622
- Update Pipeline Parallelism description in README_GAUDI. by @jmaksymc in #1567
- Gemma3 v1.22 changes (Sliding_Window feature + few others) by @hsubramony in #1720
- [CI] List passed and failed models at the end of lm_eval test suite by @kzawora-intel in #1571
- Increase regional compilation multiplier by @kwisniewski98 in #1771
- Port: V0 aware padding scheduler batch_size fix by @iboiko-habana in #1805
- Port: Fix merged prefill with new bucketing manager (#1746) by @adobrzyn in #1806
- Update CODEOWNERS by @mgawarkiewicz-intel in #1813
- [SW-232910] Poor TTFT troubleshooting tip by @michalkuligowski in #1801
- [SW-235019] Fix for Invalid credentials in Authorization header, Qwen1.5-0.5B-Chat by @pawel-olejniczak in #1826
- Add detok in chat completion fn for non stream mode when VLLM_DETOKENIZE_ON_OPENAI_SERVER=true by @shepark in #1768
- [SW-235104][vLLM] pipeline_entrypoints - matmul(): argument 'input' (position 1) must be Tensor, not NoneType by @hsubramony in #1687
- [SW-238029] Fix max_batch_size handling - Lllama perf degradation fix by @jiminha in #1839
- Port: v0 aware padding scheduler fix for bs=1 by @iboiko-habana in #1843
- Cherrypick from 1.22_next to main by @PatrykWo in #1860
- Fix sliding-window, bs=0 issue by @afierka-intel in #1908
- Modification of vllm docker image readme to be sync with 1.22 release. by @PatrykWo in #1920
- [SW-240222] pin ray to <2.49.0 by @ldurejko in #1919
- [SW-235186] Update vllm-hpu-extension with support group indexing by @jmamzax in #1867
- Update common.txt (#1956) by @afierka-intel in #1963
- Fix not cleared globals in runtime config by @afierka-intel in #1983
- Fix APC decode long context by @kamil-kaczor in #2022
- Fix long context APC warmup by @kamil-kaczor in #2004
- Cherry-pick EOL docs change (#2030) by @PatrykWo in https://g...
v0.9.0.1.post1+Gaudi-1.22.2
vLLM with Intel® Gaudi® AI Accelerators
This version is based on vLLM 0.9.0.1 and supports Intel® Gaudi® v1.22.2.
What's Changed
- Update fork branch in docker to 1.22.2 release by @PatrykWo in #2191
- Add h11-max args to cli_args by @agrabow in #2192
- Add missing description of enable_mm_embeds parameter by @afierka-intel in #2200
We are providing the following fixes to mitigate identified security vulnerabilities in this release.
CVE-2025-48956 Fix Limit HTTP header count and size by @agrabow in #2173
CVE-2025-59425 Fix flaw in token authentication logic by @agrabow in #2177
CVE-2025-6242 Fix for Server-Side Request Forgery vulnerability by @agrabow in #2180
CVE-2025-62372 [Frontend] Require flag for loading text and image embeds by @agrabow in #2185
Full Changelog: v0.9.0.1+Gaudi-1.22.2...v0.9.0.1.post1+Gaudi-1.22.2
v0.9.0.1+Gaudi-1.22.2
vLLM with Intel® Gaudi® AI Accelerators
This version is based on vLLM 0.9.0.1 and supports Intel® Gaudi® v1.22.2.
What's Changed
- Update common.txt by @afierka-intel in #2150
- RHEL build fix - Yum update by @PatrykWo in #2151
- Fix xgrammar fallback for v0 by @12010486 in #2155
- Fix Gaudi UBI image build (#2014) by @ghandoura in #2156
Known issues and addressed by
- CVE-2025-48956 Fix Limit HTTP header count and size by @agrabow in #2173
- CVE-2025-59425 Fix flaw in token authentication logic by @agrabow in #2177
- CVE-2025-6242 Fix for Server-Side Request Forgery vulnerability by @agrabow in #2180
- CVE-2025-62372 [Frontend] Require flag for loading text and image embeds by @agrabow in #2185
New Contributors
Full Changelog: v0.9.0.1+Gaudi-1.22.0...v0.9.0.1+Gaudi-1.22.2
v0.9.0.1+Gaudi-1.22.0
vLLM with Intel® Gaudi® AI Accelerators
This README provides instructions on how to run vLLM with Intel Gaudi devices.
Requirements and Installation
To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.22.0 and above
Running vLLM on Gaudi with Docker Compose
Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the instruction to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.
Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Ubuntu
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.
Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
Note
Prerequisite: Starting from the 1.22.x Intel Gaudi software version, the RHEL Docker image must be created manually before running the command. Additionally, the path to the Docker image must be updated in the Dockerfile.hpu.ubi file.
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.9.0.1+Gaudi-1.22.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from the vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
| Feature | Description | References |
|---|---|---|
| Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
| Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
| HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
| Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
| Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
| Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | Documentation Example HCCL reference |
| Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | Documentation Running Pipeline Parallelism |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
| Inference with torch.compile | vLLM HPU backend supports inference with torch.compile fully supports FP8 and BF16 precisions. |
vLLM HPU backend execution modes |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | Documentation |
| AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | Library |
| AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | Library |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
| Multi-step schedulin... |
v0.8.5+Gaudi-1.22.0-aice-v0
What's Changed
- Re-integrate HPU after upstream refactors by @kzawora-intel in #20
- Fix model_output_idx on HPU by @madamczyk-intel in #27
- Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
- Rebase habana_main up to cc466a3 by @kzawora-intel in #26
- WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
- bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
- Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
- Cleanup: Restore int64 sampling by @kzawora-intel in #35
- Cleanup: Llama whitespace fix by @kzawora-intel in #36
- Cleanup: Restore pyproject.toml by @kzawora-intel in #37
- Add vLLM high-level profiler by @DamianSzwichtenberg in #29
- Add release docs for Gaudi by @kzawora-intel in #32
- Minor: update release tag in README by @kzawora-intel in #39
- Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
- Static fused moe op by @jkaniecki in #41
- WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
- Use setuptools older than 70.0.0 by @madamczyk-intel in #42
- Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
- Graphs v2 by @madamczyk-intel in #44
- Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
- Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
- Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
- Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
- Bump ray version to 2.23.0 by @kzawora-intel in #52
- Skip incompatible tests with HPU by @afierka-intel in #46
- Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
- Add syncs in mixtral weight loader by @jkaniecki in #55
- HPU: Change KV-cache layout by @madamczyk-intel in #56
- Add more detailed event names to profiler by @kzawora-intel in #57
- Disable value splitting by default on G3 by @madamczyk-intel in #58
- Fix for OOM in Llama 70b by @tzielinski-habana in #60
- Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
- Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
- Add Mistal&Mixtral supported configurations by @szutenberg in #64
- Normalize router weights in MoE OP by @jkaniecki in #72
- Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
- Add more metrics to high level profiler by @kzawora-intel in #63
- [Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
- Remove allgather workaround in logits_processor by @kzawora-intel in #76
- habana_main rebase by @kzawora-intel in #81
- Conform to vLLM formatting rules by @kzawora-intel in #83
- SiLU memory leak in fwd by @michalkuligowski in #87
- habana_main rebase v4 by @kzawora-intel in #85
- Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
- Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
- Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
- formatting fixes by @kzawora-intel in #132
- Address upstream PR code review comments by @kzawora-intel in #133
- Whitespace fix by @kzawora-intel in #134
- Add torch.compile support by @kzawora-intel in #48
- habana_main rebase v5 by @kzawora-intel in #108
- Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
- Remove redundant torch.device call by @kzawora-intel in #139
- Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
- Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
- Re-enable FusedRoPE by @kzawora-intel in #145
- Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
- Allocate blocks from id=1 for HPU by @kdamaszk in #160
- Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
- Reimplement silu_and_mul for mixtral by @jkaniecki in #167
- Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
- remove reminder_comment.yml by @kzawora-intel in #179
- Fix logger initialization in ops.py by @kzawora-intel in #178
- 1.17 documentation update by @kzawora-intel in #172
- Readme 1.17 update by @kzawora-intel in #186
- Support FP8 INC in vLLM by @nirda7 in #144
- [Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
- split gptbigcode forward by @libinta in #194
- Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
- Enable LoRA support for HPU by @scsudhak-intel in #170
- Compile mode bug fix for LoRA by @scsudhak-intel in #196
- Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
- Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
- Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
- Port not warmed-up configurations log warnings by @adobrzyn in #222
- Remove mark step from static MoE loop by @jkaniecki in #231
- Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
- [bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
- Remove token budget from decode buckets by @kzawora-intel in #241
- [habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
- Mask based BGMV implementation by @hlahkar in #223
- Dispersed dummy slots by @madamczyk-intel in #243
- Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
- Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
- Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
- Hardcode fastapi version due to pydantic error by @hlahkar in #255
- Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
- Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
- Port flat PA from habana_next to habana_main by @dolszewska in #169
- Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
- Fix dispersed slots by @madamczyk-intel in #261
- Skip compilation warnings during warmup phase by @jkaniecki in #262
- Port PT Profiler to habana_main by @adobrzyn in #256
- Fix Lo...
v0.8.5.post1+Gaudi-1.21.3
What's Changed
- Update requirements-hpu.txt by @michalkuligowski in #1018
- [SW-224648] Redirect test logs to file by @bmyrcha in #1016
- add ScaleToHwAligned for loading fp8 vllm model by @changwangss in #941
- Fix async callback ordering by @madamczyk-intel in #1023
- Implement Pipeline Parallelism support for HPU. by @jmaksymczuk in #1000
- Make lazy mode autodetection more robust by @kzawora-intel in #921
- [SW-224648] Fix test logs redirection by @bmyrcha in #1026
- [CI] Add APC tests by @kzawora-intel in #866
- [SW-225233] Adjust method of getting synapse_build by @bmyrcha in #1044
- Add more testowners by @adobrzyn in #1046
- APC - Remove prompt attn with context and use existing implementation by @adobrzyn in #1020
- Add exponential bucketing integration by @kzawora-intel in #642
- Marketing requested additional details of the ramp-up phase. by @MohitIntel in #1069
- Add in Dockerfile.hpu.ubi by @AnetaKaczynska in #1077
- Synchronize vLLM flags to support cross-node inference by @IT-Forrest in #897
- Set VLLM_T_COMPILE_FULLGRAPH=False in CI multi-modal tests by @afierka-intel in #1042
- Enable APC pre-merge tests to compile test suite by @afierka-intel in #1076
- IG: fix multimodal reshape for Qwen2.5-VL (revet #691) by @imangohari1 in #1081
- Fix embedding model accuracy issue when merged prefill is enabled by @libinta in #1047
- Enable dynamic shape for torch.compile under flag by @anko-intel in #1078
- [SW-225980] Allow to skip pytest for non-code related changes by @bmyrcha in #1092
- Update CODEOWNERS by @mgawarkiewicz-intel in #1107
- fix prepare_cos_sin invoke in RotaryEmbedding by @zhouyu5 in #1035
- multi-image support for llama3.2 [1/N] by @zhouyu5 in #926
- Add t.compile fp8 performance test to jenkins by @bkowalskiINTEL in #1066
- Update run-tests.sh by @michalkuligowski in #1117
- Rebase - 2025.04.06 by @kzawora-intel in #947
- Revert "Rebase - 2025.04.06" by @kzawora-intel in #1128
- Rebase mar 24 again by @michalkuligowski in #1127
- Restore fsdpa calibration by @madamczyk-intel in #1086
- Rebase mar 24 fixed by @michalkuligowski in #1130
- Simplify calling torch.compile by @anko-intel in #1140
- Bump xgrammar from 0.1.11 to 0.1.18 by @dependabot[bot] in #1043
- Update requirements-hpu.txt by @afierka-intel in #1125
- Modify RobertaEmbedding forward as custom op method by @yeonsily in #996
- [TC] Fix to graph break inside set_block_mapping by @jczaja in #1143
- [SW-224668] Fix for LLaMA LoRA test_layers_hpu by @rsshaik1 in #1074
- [SW-224666] Fix for LLaMA LoRA test_lora_manager_hpu by @rsshaik1 in #1070
- Fix profiling collection for VLLM_PT_PROFILE by @mswiniarsk in #1156
- Enable torchrun on Gaudi by @czhu15 in #974
- Minor fix regd. VLLM_GRAPH_PROMPT_RATIO in README_GAUDI.md by @MohitIntel in #1168
- Fix accuracy issue for llama 3.2 vision models. by @libinta in #1176
- add test owner by @jikunshang in #1082
- Add additional devs to TESTOWNERS by @bkowalskiINTEL in #1075
- Update CODEOWNERS by @michalkuligowski in #1185
- [SPEC_DECODE][V0] fix for spec decode eagle after rebase by @xuechendi in #1150
- Fix fixture duplication in async_engine tests by @akarnows in #1180
- Rebase apr 25 by @michalkuligowski in #1166
- [SW-225282] - Handle Batch Dimension for LoRA by @hlahkar in #1182
- Rebase apr 30 by @michalkuligowski in #1190
- Reduce recompilations when using merged_prefill by @madamczyk-intel in #1167
- Update TESTOWNERS by @madamczyk-intel in #1200
- [SW-225635] Adjust logging in CI by @bmyrcha in #1202
- Switch V1 env to False as default by @afierka-intel in #1206
- Update codeowners by @madamczyk-intel in #1217
- Rebase may 06 by @michalkuligowski in #1207
- [V1] Set dynamo cache size even if warmup is skipped by @Kacper-Pietkun in #1173
- Introduce block_softmax_adjustment kernel by @madamczyk-intel in #1174
- add missing transpose in MultiHeadAttention by @zhouyu5 in #1218
- [Spec Decode] Fix MLP speculative failing issue after rebase to Apr 30 by @xuechendi in #1210
- [Deepseek R1][v0] Porting deepseek r1 to habana_main by @xuechendi in #1161
- Set vllm-hpu-extension to 89030c by @madamczyk-intel in #1228
- Set hpu-extension to a060794 by @madamczyk-intel in #1232
- Add VLLM_PROFILE_* flags to V1 by @madamczyk-intel in #1203
- Update Dockerfile.hpu.ubi by @AnetaKaczynska in #1205
- Fix INC Finalization Check by @yiliu30 in #1230
- [CI] Align t.compile and lazy test definitions by @anko-intel in #1157
- [SW-228109][v0] [llama4 ]Llama 4 support for vLLM fork by @leopck in #1235
- fix dummy sequence length setting in llama3.2 by @zhouyu5 in #1229
- Enable Delayed Sampling by default by @mswiniarsk in #937
- [V1] Port t.compile optimizations from V0 to V1 by @Kacper-Pietkun in #1237
- [V1] enable fp8 by @Kacper-Pietkun in #1222
- Switch to V0 by default in envs.py by @kwisniewski98 in #1233
- [SW-228755] Fix CI for v0 spec decode fix by @xuechendi in #1252
- Apply test permission by @zhouyu5 in #1258
- [CI] Align t.compile and lazy tests by @anko-intel in #1250
- [BugFix] Fix --disable-log-stats in V1 server mode vllm-project#17600 by @iboiko-habana in #1249
- [SW-219737][habana_main] Support MTP to deepseek by @xuechendi in #1254
- fix text only input for llama3.2 by @zhouyu5 in #1262
- Remove intel implementation of top-p/top-k sampling method by @afierka-intel in #1243
- [CI] Add benchamrk return status by @anko-intel in #1259
- [habana_main]enable padding_aware_scheduler for speculative decoding by @xuechendi in #1264
- Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile by @anko-intel in #1184
- [SW-228365] - Update test cases for Lora by @hlahkar in #1256
- fix embedding crash with torch.compile by @libinta in #1213
- WA for CI - pkg resources by @adobrzyn in #1280
- [SW-228266] Fix LoRA layers test by @hlahkar in #1276
- Skip guards after fully warmup the model by @anko-intel in #1272
- Replace in-place add with out-of-place add in layernorm forward_hpu. by @jmaksymc in #1281
- Add 256 as possible option within block-size arg by @ksmusz in #1279
- Flat KV cache layout by @kdamaszk in #1106
- [Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432) by @adobrzyn in https://github.com/HabanaAI/vllm-fork/pull/...
v0.8.5+Gaudi-1.21.2-aice-v0
What's Changed
- Re-integrate HPU after upstream refactors by @kzawora-intel in #20
- Fix model_output_idx on HPU by @madamczyk-intel in #27
- Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
- Rebase habana_main up to cc466a3 by @kzawora-intel in #26
- WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
- bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
- Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
- Cleanup: Restore int64 sampling by @kzawora-intel in #35
- Cleanup: Llama whitespace fix by @kzawora-intel in #36
- Cleanup: Restore pyproject.toml by @kzawora-intel in #37
- Add vLLM high-level profiler by @DamianSzwichtenberg in #29
- Add release docs for Gaudi by @kzawora-intel in #32
- Minor: update release tag in README by @kzawora-intel in #39
- Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
- Static fused moe op by @jkaniecki in #41
- WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
- Use setuptools older than 70.0.0 by @madamczyk-intel in #42
- Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
- Graphs v2 by @madamczyk-intel in #44
- Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
- Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
- Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
- Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
- Bump ray version to 2.23.0 by @kzawora-intel in #52
- Skip incompatible tests with HPU by @afierka-intel in #46
- Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
- Add syncs in mixtral weight loader by @jkaniecki in #55
- HPU: Change KV-cache layout by @madamczyk-intel in #56
- Add more detailed event names to profiler by @kzawora-intel in #57
- Disable value splitting by default on G3 by @madamczyk-intel in #58
- Fix for OOM in Llama 70b by @tzielinski-habana in #60
- Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
- Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
- Add Mistal&Mixtral supported configurations by @szutenberg in #64
- Normalize router weights in MoE OP by @jkaniecki in #72
- Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
- Add more metrics to high level profiler by @kzawora-intel in #63
- [Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
- Remove allgather workaround in logits_processor by @kzawora-intel in #76
- habana_main rebase by @kzawora-intel in #81
- Conform to vLLM formatting rules by @kzawora-intel in #83
- SiLU memory leak in fwd by @michalkuligowski in #87
- habana_main rebase v4 by @kzawora-intel in #85
- Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
- Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
- Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
- formatting fixes by @kzawora-intel in #132
- Address upstream PR code review comments by @kzawora-intel in #133
- Whitespace fix by @kzawora-intel in #134
- Add torch.compile support by @kzawora-intel in #48
- habana_main rebase v5 by @kzawora-intel in #108
- Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
- Remove redundant torch.device call by @kzawora-intel in #139
- Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
- Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
- Re-enable FusedRoPE by @kzawora-intel in #145
- Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
- Allocate blocks from id=1 for HPU by @kdamaszk in #160
- Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
- Reimplement silu_and_mul for mixtral by @jkaniecki in #167
- Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
- remove reminder_comment.yml by @kzawora-intel in #179
- Fix logger initialization in ops.py by @kzawora-intel in #178
- 1.17 documentation update by @kzawora-intel in #172
- Readme 1.17 update by @kzawora-intel in #186
- Support FP8 INC in vLLM by @nirda7 in #144
- [Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
- split gptbigcode forward by @libinta in #194
- Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
- Enable LoRA support for HPU by @scsudhak-intel in #170
- Compile mode bug fix for LoRA by @scsudhak-intel in #196
- Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
- Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
- Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
- Port not warmed-up configurations log warnings by @adobrzyn in #222
- Remove mark step from static MoE loop by @jkaniecki in #231
- Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
- [bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
- Remove token budget from decode buckets by @kzawora-intel in #241
- [habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
- Mask based BGMV implementation by @hlahkar in #223
- Dispersed dummy slots by @madamczyk-intel in #243
- Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
- Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
- Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
- Hardcode fastapi version due to pydantic error by @hlahkar in #255
- Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
- Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
- Port flat PA from habana_next to habana_main by @dolszewska in #169
- Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
- Fix dispersed slots by @madamczyk-intel in #261
- Skip compilation warnings during warmup phase by @jkaniecki in #262
- Port PT Profiler to habana_main by @adobrzyn in #256
- Fix Lo...
v0.8.5.post1+Gaudi-1.21.2
vLLM with Intel® Gaudi® AI Accelerators
This README provides instructions on how to run vLLM with Intel Gaudi devices.
Requirements and Installation
To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.21.2 and above
Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Ubuntu
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.
Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.8.5.post1+Gaudi-1.21.2
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from the vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
| Feature | Description | References |
|---|---|---|
| Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
| Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
| HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
| Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
| Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
| Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | Documentation Example HCCL reference |
| Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | Documentation Running Pipeline Parallelism |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
| Inference with torch.compile | vLLM HPU backend supports inference with torch.compile. |
vLLM HPU backend execution modes |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | Documentation |
| AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | Library |
| AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | Library |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
| Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
| Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
| Speculative decoding (functional releas... |
v0.7.2+Gaudi-1.21.0
vLLM with Intel® Gaudi® AI Accelerators
This README provides instructions on how to run vLLM with Intel Gaudi devices.
Requirements and Installation
To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.21.0 and above
Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Ubuntu
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.
Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.7.2+Gaudi-1.21.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from the vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
| Feature | Description | References |
|---|---|---|
| Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
| Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
| HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
| Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
| Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
| Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | Documentation Example HCCL reference |
| Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | Documentation Running Pipeline Parallelism |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
| Inference with torch.compile | vLLM HPU backend supports inference with torch.compile. |
vLLM HPU backend execution modes |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | Documentation |
| AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | Library |
| AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | Library |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
| Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
| Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
| Speculative decoding (functional release) ... |
v0.6.6.post1+Gaudi-1.20.0
vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.20.0
Requirements and Installation
Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Ubuntu 22.04 LTS OS
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.20.0 and above
Quick Start Using Dockerfile
Set up the container with latest release of Gaudi Software Suite using the Dockerfile:
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.6.post1+Gaudi-1.20.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
| Feature | Description | References |
|---|---|---|
| Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
| Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
| HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
| Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
| Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
| Tensor parallel inference (single-node multi-HPU) | vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL. | Documentation Example HCCL reference |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
| Inference with torch.compile (experimental) | vLLM HPU backend experimentally supports inference with torch.compile. | vLLM HPU backend execution modes |
| Attention with Linear Biases (ALiBi) | vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b. | vLLM supported models |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | Documentation |
| AutoAWQ quantization | vLLM HPU backend supports the inference with models quantized using AutoAWQ library. | Library |
| AutoGPTQ quantization | vLLM HPU backend supports the inference with models quantized using AutoGPTQ library. | Library |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
| Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
| Automatic prefix caching (experimental) | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
| Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard --speculative_model and --num_speculative_tokens parameters. |
Documentation Example |
| Multiprocessing backend | Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray. | Documentation |
Unsupported Features
- Beam s...