23 Dec 13:20

github-actions

781a338

v0.9.0.1+Gaudi-1.23.0 Latest

Latest

vLLM with Intel® Gaudi® AI Accelerators

This version is based on vLLM 0.9.0.1 and supports Intel® Gaudi® v1.23.0.

What's Changed

Enable interleaved sliding_window for gemma3 by @jiminha in #1344
docker vllm: update readme by @tthaddey in #1525
Update hpu-ext sha and temporarely disable deepseek test by @kwisniewski98 in #1534
[SW-234006] Fix requirements by @szutenberg in #1531
Embedding fix: warmup failure in embedding model by @shepark in #1510
add split qkv to gemma3 by @skaulintel in #1517
Enable vision bucketing/warmup for gemma3 model by @libinta in #1470
[SW-234248] Set pytorch version for version_branch by @tlipinski1337 in #1553
change vllm-hpu-extension sha to 71a66fb by @iboiko-habana in #1544
docker vllm: cleanup configs and add missing models by @tthaddey in #1549
Add accelerate to requirements/hpu.txt by @tpawlows in #1564
Fix AttributeError: 'NoneType' object has no attribute 'getenv' by @michalkuligowski in #1554
Set hpu-extension to 222acde by @ksmusz in #1545
Fix V1 uniproc executor segfaults by @kzawora-intel in #1570
Update Force Channel FP8 Check by @yiliu30 in #1561
docker vllm: add server config for model Qwen/Qwen2.5-VL-7B-Instruct by @tthaddey in #1577
Fix calling shutdown inc in del by @michalkuligowski in #1574
[SW-234248] Take pytorch version directly from bridge repo by @tlipinski1337 in #1572
fix prompt_logprob crash when delayed sampling is on by @ccrhx4 in #1421
[V0] Use device as the set_device's parameter by default, update proxy of pd by @zhenwei-intel in #1540
Ray vllm example in 'latest' link does not work README_GAUDI.md by @michalkuligowski in #1390
[SW-234741] Use internal token for fetching pt_version by @tlipinski1337 in #1583
Readme warmup update by @adobrzyn in #1512
Change vllm-hpu-extension revision to f831cb1 by @iboiko-habana in #1587
Updated README_GAUDI.md with gaudinet.json prereq by @anastasiauvarovaintel in #1588
Num blocks fix - V1 by @adobrzyn in #1594
docker vllm: Split entrypoints in separate clases and update vllm installation in docker by @tthaddey in #1602
V1 - dont look for bucket we know don't exists by @adobrzyn in #1606
Added support for FusedSDPA kernel with window_size for Gemma3 by @jiminha in #1589
remove logic that uses more memory in prepare_attn_masks by @libinta in #1597
Fix AttributeError during shutdown of RayDistributedExecutor by @tpawlows in #1599
Fix warmup skip and cleanup for gemma3-vl by @libinta in #1623
Update extension - Fix fallback buckets by @adobrzyn in #1624
[SW-235047] use w8a8 path for per_channel for performance regression fixing by @xuechendi in #1629
Port high-level profiler to V1 engine by @jkaniecki in #1501
[V1][MLA][SW-234434] Enable MLA for V1 - ported from vllm-gaudi by @xuechendi in #1628
gemma3: fix accuracy issue caused by not skipping image on top right by @libinta in #1635
Fix: Round up to sliding window threshold - update extension by @adobrzyn in #1637
Enable LMCache for cpuoffloading, LMCache docker support, enable lmcache by @shepark in #1645
[Security] Fix: Structurally dead code (#1625) by @afierka-intel in #1639
[Security] Fix: Bad use of null-like value (#1634) by @afierka-intel in #1640
Update hpu.txt by @afierka-intel in #1654
Remove dtype.float16 support for hpu config by @iboiko-habana in #1657
[SW-234344] Fix 'RotaryEmbedding' object has no attribute 'sin' by @xuechendi in #1658
ValueError: 'aimv2' is already used by a Transformers config by @michalkuligowski in #1680
[V1] Defragmentation support by @madamczyk-intel in #1568
Set hpu-extension to 6b2f6fb by @ksmusz in #1684
Remove inference_mode() from platforms.hpu by @jkaniecki in #1691
Remove V1 HPU support from the fork by @kzawora-intel in #1707
skip softmax/log_softmax when greedy_sampling with no logprobs by @xuechendi in #1711
[SW-234516] Fix padding for padding aware path by @PatrykWilczewski in #1702
[SW-234805] Fix target_device for weights load by @kfojcik-intel in #1733
use value for mrope check by @xuechendi in #1740
move detoken to serving_client by @xuechendi in #1741
add env and remove mark_step by @xuechendi in #1739
Fix Data Parallel by @xinyu-intel in #1742
[SW-236277] Fix RotaryEmbedding cos-sin prepare by @kfojcik-intel in #1765
updata vllm hpu extenstion commit by @xuechendi in #1759
Update TESTOWNERS by @mgawarkiewicz-intel in #1757
adding wpyszka to codeowners by @wpyszka in #1480
Fix text-only prompt in Llama Vision (#1621) by @kdamaszk in #1622
Update Pipeline Parallelism description in README_GAUDI. by @jmaksymc in #1567
Gemma3 v1.22 changes (Sliding_Window feature + few others) by @hsubramony in #1720
[CI] List passed and failed models at the end of lm_eval test suite by @kzawora-intel in #1571
Increase regional compilation multiplier by @kwisniewski98 in #1771
Port: V0 aware padding scheduler batch_size fix by @iboiko-habana in #1805
Port: Fix merged prefill with new bucketing manager (#1746) by @adobrzyn in #1806
Update CODEOWNERS by @mgawarkiewicz-intel in #1813
[SW-232910] Poor TTFT troubleshooting tip by @michalkuligowski in #1801
[SW-235019] Fix for Invalid credentials in Authorization header, Qwen1.5-0.5B-Chat by @pawel-olejniczak in #1826
Add detok in chat completion fn for non stream mode when VLLM_DETOKENIZE_ON_OPENAI_SERVER=true by @shepark in #1768
[SW-235104][vLLM] pipeline_entrypoints - matmul(): argument 'input' (position 1) must be Tensor, not NoneType by @hsubramony in #1687
[SW-238029] Fix max_batch_size handling - Lllama perf degradation fix by @jiminha in #1839
Port: v0 aware padding scheduler fix for bs=1 by @iboiko-habana in #1843
Cherrypick from 1.22_next to main by @PatrykWo in #1860
Fix sliding-window, bs=0 issue by @afierka-intel in #1908
Modification of vllm docker image readme to be sync with 1.22 release. by @PatrykWo in #1920
[SW-240222] pin ray to <2.49.0 by @ldurejko in #1919
[SW-235186] Update vllm-hpu-extension with support group indexing by @jmamzax in #1867
Update common.txt (#1956) by @afierka-intel in #1963
Fix not cleared globals in runtime config by @afierka-intel in #1983
Fix APC decode long context by @kamil-kaczor in #2022
Fix long context APC warmup by @kamil-kaczor in #2004
Cherry-pick EOL docs change (#2030) by @PatrykWo in https://g...

Contributors

sureshnam, xuechendi, and 40 other contributors

Assets 2

19 Dec 13:38

PatrykWo

v0.9.0.1.post1+Gaudi-1.22.2

1af79fb

v0.9.0.1.post1+Gaudi-1.22.2

vLLM with Intel® Gaudi® AI Accelerators

This version is based on vLLM 0.9.0.1 and supports Intel® Gaudi® v1.22.2.

What's Changed

Update fork branch in docker to 1.22.2 release by @PatrykWo in #2191
Add h11-max args to cli_args by @agrabow in #2192
Add missing description of enable_mm_embeds parameter by @afierka-intel in #2200

We are providing the following fixes to mitigate identified security vulnerabilities in this release.

CVE-2025-48956 Fix Limit HTTP header count and size by @agrabow in #2173
CVE-2025-59425 Fix flaw in token authentication logic by @agrabow in #2177
CVE-2025-6242 Fix for Server-Side Request Forgery vulnerability by @agrabow in #2180
CVE-2025-62372 [Frontend] Require flag for loading text and image embeds by @agrabow in #2185

Full Changelog: v0.9.0.1+Gaudi-1.22.2...v0.9.0.1.post1+Gaudi-1.22.2

Contributors

PatrykWo, afierka-intel, and agrabow

Assets 2

27 Nov 12:13

github-actions

v0.9.0.1+Gaudi-1.22.2

588b6c9

v0.9.0.1+Gaudi-1.22.2

vLLM with Intel® Gaudi® AI Accelerators

This version is based on vLLM 0.9.0.1 and supports Intel® Gaudi® v1.22.2.

What's Changed

Update common.txt by @afierka-intel in #2150
RHEL build fix - Yum update by @PatrykWo in #2151
Fix xgrammar fallback for v0 by @12010486 in #2155
Fix Gaudi UBI image build (#2014) by @ghandoura in #2156

Known issues and addressed by

CVE-2025-48956 Fix Limit HTTP header count and size by @agrabow in #2173
CVE-2025-59425 Fix flaw in token authentication logic by @agrabow in #2177
CVE-2025-6242 Fix for Server-Side Request Forgery vulnerability by @agrabow in #2180
CVE-2025-62372 [Frontend] Require flag for loading text and image embeds by @agrabow in #2185

New Contributors

@12010486 made their first contribution in #2155

Full Changelog: v0.9.0.1+Gaudi-1.22.0...v0.9.0.1+Gaudi-1.22.2

Contributors

PatrykWo, ghandoura, and 3 other contributors

Assets 2

05 Sep 07:29

github-actions

v0.9.0.1+Gaudi-1.22.0

2e9b2b3

v0.9.0.1+Gaudi-1.22.0

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.22.0 and above

Running vLLM on Gaudi with Docker Compose

Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the instruction to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Note

Prerequisite: Starting from the 1.22.x Intel Gaudi software version, the RHEL Docker image must be created manually before running the command. Additionally, the path to the Docker image must be updated in the Dockerfile.hpu.ubi file.

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.9.0.1+Gaudi-1.22.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL.	Documentation Example HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism.	Documentation Running Pipeline Parallelism
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile	vLLM HPU backend supports inference with `torch.compile` fully supports FP8 and BF16 precisions.	vLLM HPU backend execution modes
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC).	Documentation
AutoAWQ quantization	vLLM HPU backend supports inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step schedulin...

Contributors

CAC, leopck, and 85 other contributors

Assets 2

13 Aug 08:17

github-actions

v0.8.5+Gaudi-1.22.0-aice-v0

6693645

v0.8.5+Gaudi-1.22.0-aice-v0 Pre-release

Pre-release

What's Changed

Re-integrate HPU after upstream refactors by @kzawora-intel in #20
Fix model_output_idx on HPU by @madamczyk-intel in #27
Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
Rebase habana_main up to cc466a3 by @kzawora-intel in #26
WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
Cleanup: Restore int64 sampling by @kzawora-intel in #35
Cleanup: Llama whitespace fix by @kzawora-intel in #36
Cleanup: Restore pyproject.toml by @kzawora-intel in #37
Add vLLM high-level profiler by @DamianSzwichtenberg in #29
Add release docs for Gaudi by @kzawora-intel in #32
Minor: update release tag in README by @kzawora-intel in #39
Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
Static fused moe op by @jkaniecki in #41
WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
Use setuptools older than 70.0.0 by @madamczyk-intel in #42
Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
Graphs v2 by @madamczyk-intel in #44
Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
Bump ray version to 2.23.0 by @kzawora-intel in #52
Skip incompatible tests with HPU by @afierka-intel in #46
Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
Add syncs in mixtral weight loader by @jkaniecki in #55
HPU: Change KV-cache layout by @madamczyk-intel in #56
Add more detailed event names to profiler by @kzawora-intel in #57
Disable value splitting by default on G3 by @madamczyk-intel in #58
Fix for OOM in Llama 70b by @tzielinski-habana in #60
Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
Add Mistal&Mixtral supported configurations by @szutenberg in #64
Normalize router weights in MoE OP by @jkaniecki in #72
Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
Add more metrics to high level profiler by @kzawora-intel in #63
[Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
Remove allgather workaround in logits_processor by @kzawora-intel in #76
habana_main rebase by @kzawora-intel in #81
Conform to vLLM formatting rules by @kzawora-intel in #83
SiLU memory leak in fwd by @michalkuligowski in #87
habana_main rebase v4 by @kzawora-intel in #85
Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
formatting fixes by @kzawora-intel in #132
Address upstream PR code review comments by @kzawora-intel in #133
Whitespace fix by @kzawora-intel in #134
Add torch.compile support by @kzawora-intel in #48
habana_main rebase v5 by @kzawora-intel in #108
Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
Remove redundant torch.device call by @kzawora-intel in #139
Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
Re-enable FusedRoPE by @kzawora-intel in #145
Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
Allocate blocks from id=1 for HPU by @kdamaszk in #160
Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
Reimplement silu_and_mul for mixtral by @jkaniecki in #167
Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
remove reminder_comment.yml by @kzawora-intel in #179
Fix logger initialization in ops.py by @kzawora-intel in #178
1.17 documentation update by @kzawora-intel in #172
Readme 1.17 update by @kzawora-intel in #186
Support FP8 INC in vLLM by @nirda7 in #144
[Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
split gptbigcode forward by @libinta in #194
Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
Enable LoRA support for HPU by @scsudhak-intel in #170
Compile mode bug fix for LoRA by @scsudhak-intel in #196
Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
Port not warmed-up configurations log warnings by @adobrzyn in #222
Remove mark step from static MoE loop by @jkaniecki in #231
Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
[bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
Remove token budget from decode buckets by @kzawora-intel in #241
[habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
Mask based BGMV implementation by @hlahkar in #223
Dispersed dummy slots by @madamczyk-intel in #243
Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
Hardcode fastapi version due to pydantic error by @hlahkar in #255
Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
Port flat PA from habana_next to habana_main by @dolszewska in #169
Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
Fix dispersed slots by @madamczyk-intel in #261
Skip compilation warnings during warmup phase by @jkaniecki in #262
Port PT Profiler to habana_main by @adobrzyn in #256
Fix Lo...

Contributors

CAC, leopck, and 85 other contributors

Assets 2

22 Jul 17:34

github-actions

v0.8.5.post1+Gaudi-1.21.3

3bcdfd4

v0.8.5.post1+Gaudi-1.21.3 Pre-release

Pre-release

What's Changed

Update requirements-hpu.txt by @michalkuligowski in #1018
[SW-224648] Redirect test logs to file by @bmyrcha in #1016
add ScaleToHwAligned for loading fp8 vllm model by @changwangss in #941
Fix async callback ordering by @madamczyk-intel in #1023
Implement Pipeline Parallelism support for HPU. by @jmaksymczuk in #1000
Make lazy mode autodetection more robust by @kzawora-intel in #921
[SW-224648] Fix test logs redirection by @bmyrcha in #1026
[CI] Add APC tests by @kzawora-intel in #866
[SW-225233] Adjust method of getting synapse_build by @bmyrcha in #1044
Add more testowners by @adobrzyn in #1046
APC - Remove prompt attn with context and use existing implementation by @adobrzyn in #1020
Add exponential bucketing integration by @kzawora-intel in #642
Marketing requested additional details of the ramp-up phase. by @MohitIntel in #1069
Add in Dockerfile.hpu.ubi by @AnetaKaczynska in #1077
Synchronize vLLM flags to support cross-node inference by @IT-Forrest in #897
Set VLLM_T_COMPILE_FULLGRAPH=False in CI multi-modal tests by @afierka-intel in #1042
Enable APC pre-merge tests to compile test suite by @afierka-intel in #1076
IG: fix multimodal reshape for Qwen2.5-VL (revet #691) by @imangohari1 in #1081
Fix embedding model accuracy issue when merged prefill is enabled by @libinta in #1047
Enable dynamic shape for torch.compile under flag by @anko-intel in #1078
[SW-225980] Allow to skip pytest for non-code related changes by @bmyrcha in #1092
Update CODEOWNERS by @mgawarkiewicz-intel in #1107
fix prepare_cos_sin invoke in RotaryEmbedding by @zhouyu5 in #1035
multi-image support for llama3.2 [1/N] by @zhouyu5 in #926
Add t.compile fp8 performance test to jenkins by @bkowalskiINTEL in #1066
Update run-tests.sh by @michalkuligowski in #1117
Rebase - 2025.04.06 by @kzawora-intel in #947
Revert "Rebase - 2025.04.06" by @kzawora-intel in #1128
Rebase mar 24 again by @michalkuligowski in #1127
Restore fsdpa calibration by @madamczyk-intel in #1086
Rebase mar 24 fixed by @michalkuligowski in #1130
Simplify calling torch.compile by @anko-intel in #1140
Bump xgrammar from 0.1.11 to 0.1.18 by @dependabot[bot] in #1043
Update requirements-hpu.txt by @afierka-intel in #1125
Modify RobertaEmbedding forward as custom op method by @yeonsily in #996
[TC] Fix to graph break inside set_block_mapping by @jczaja in #1143
[SW-224668] Fix for LLaMA LoRA test_layers_hpu by @rsshaik1 in #1074
[SW-224666] Fix for LLaMA LoRA test_lora_manager_hpu by @rsshaik1 in #1070
Fix profiling collection for VLLM_PT_PROFILE by @mswiniarsk in #1156
Enable torchrun on Gaudi by @czhu15 in #974
Minor fix regd. VLLM_GRAPH_PROMPT_RATIO in README_GAUDI.md by @MohitIntel in #1168
Fix accuracy issue for llama 3.2 vision models. by @libinta in #1176
add test owner by @jikunshang in #1082
Add additional devs to TESTOWNERS by @bkowalskiINTEL in #1075
Update CODEOWNERS by @michalkuligowski in #1185
[SPEC_DECODE][V0] fix for spec decode eagle after rebase by @xuechendi in #1150
Fix fixture duplication in async_engine tests by @akarnows in #1180
Rebase apr 25 by @michalkuligowski in #1166
[SW-225282] - Handle Batch Dimension for LoRA by @hlahkar in #1182
Rebase apr 30 by @michalkuligowski in #1190
Reduce recompilations when using merged_prefill by @madamczyk-intel in #1167
Update TESTOWNERS by @madamczyk-intel in #1200
[SW-225635] Adjust logging in CI by @bmyrcha in #1202
Switch V1 env to False as default by @afierka-intel in #1206
Update codeowners by @madamczyk-intel in #1217
Rebase may 06 by @michalkuligowski in #1207
[V1] Set dynamo cache size even if warmup is skipped by @Kacper-Pietkun in #1173
Introduce block_softmax_adjustment kernel by @madamczyk-intel in #1174
add missing transpose in MultiHeadAttention by @zhouyu5 in #1218
[Spec Decode] Fix MLP speculative failing issue after rebase to Apr 30 by @xuechendi in #1210
[Deepseek R1][v0] Porting deepseek r1 to habana_main by @xuechendi in #1161
Set vllm-hpu-extension to 89030c by @madamczyk-intel in #1228
Set hpu-extension to a060794 by @madamczyk-intel in #1232
Add VLLM_PROFILE_* flags to V1 by @madamczyk-intel in #1203
Update Dockerfile.hpu.ubi by @AnetaKaczynska in #1205
Fix INC Finalization Check by @yiliu30 in #1230
[CI] Align t.compile and lazy test definitions by @anko-intel in #1157
[SW-228109][v0] [llama4 ]Llama 4 support for vLLM fork by @leopck in #1235
fix dummy sequence length setting in llama3.2 by @zhouyu5 in #1229
Enable Delayed Sampling by default by @mswiniarsk in #937
[V1] Port t.compile optimizations from V0 to V1 by @Kacper-Pietkun in #1237
[V1] enable fp8 by @Kacper-Pietkun in #1222
Switch to V0 by default in envs.py by @kwisniewski98 in #1233
[SW-228755] Fix CI for v0 spec decode fix by @xuechendi in #1252
Apply test permission by @zhouyu5 in #1258
[CI] Align t.compile and lazy tests by @anko-intel in #1250
[BugFix] Fix --disable-log-stats in V1 server mode vllm-project#17600 by @iboiko-habana in #1249
[SW-219737][habana_main] Support MTP to deepseek by @xuechendi in #1254
fix text only input for llama3.2 by @zhouyu5 in #1262
Remove intel implementation of top-p/top-k sampling method by @afierka-intel in #1243
[CI] Add benchamrk return status by @anko-intel in #1259
[habana_main]enable padding_aware_scheduler for speculative decoding by @xuechendi in #1264
Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile by @anko-intel in #1184
[SW-228365] - Update test cases for Lora by @hlahkar in #1256
fix embedding crash with torch.compile by @libinta in #1213
WA for CI - pkg resources by @adobrzyn in #1280
[SW-228266] Fix LoRA layers test by @hlahkar in #1276
Skip guards after fully warmup the model by @anko-intel in #1272
Replace in-place add with out-of-place add in layernorm forward_hpu. by @jmaksymc in #1281
Add 256 as possible option within block-size arg by @ksmusz in #1279
Flat KV cache layout by @kdamaszk in #1106
[Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432) by @adobrzyn in https://github.com/HabanaAI/vllm-fork/pull/...

Contributors

leopck, xuechendi, and 38 other contributors

Assets 2

16 Jul 02:12

github-actions

v0.8.5+Gaudi-1.21.2-aice-v0

578b34a

v0.8.5+Gaudi-1.21.2-aice-v0 Pre-release

Pre-release

What's Changed

Re-integrate HPU after upstream refactors by @kzawora-intel in #20
Fix model_output_idx on HPU by @madamczyk-intel in #27
Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
Rebase habana_main up to cc466a3 by @kzawora-intel in #26
WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
Cleanup: Restore int64 sampling by @kzawora-intel in #35
Cleanup: Llama whitespace fix by @kzawora-intel in #36
Cleanup: Restore pyproject.toml by @kzawora-intel in #37
Add vLLM high-level profiler by @DamianSzwichtenberg in #29
Add release docs for Gaudi by @kzawora-intel in #32
Minor: update release tag in README by @kzawora-intel in #39
Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
Static fused moe op by @jkaniecki in #41
WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
Use setuptools older than 70.0.0 by @madamczyk-intel in #42
Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
Graphs v2 by @madamczyk-intel in #44
Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
Bump ray version to 2.23.0 by @kzawora-intel in #52
Skip incompatible tests with HPU by @afierka-intel in #46
Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
Add syncs in mixtral weight loader by @jkaniecki in #55
HPU: Change KV-cache layout by @madamczyk-intel in #56
Add more detailed event names to profiler by @kzawora-intel in #57
Disable value splitting by default on G3 by @madamczyk-intel in #58
Fix for OOM in Llama 70b by @tzielinski-habana in #60
Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
Add Mistal&Mixtral supported configurations by @szutenberg in #64
Normalize router weights in MoE OP by @jkaniecki in #72
Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
Add more metrics to high level profiler by @kzawora-intel in #63
[Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
Remove allgather workaround in logits_processor by @kzawora-intel in #76
habana_main rebase by @kzawora-intel in #81
Conform to vLLM formatting rules by @kzawora-intel in #83
SiLU memory leak in fwd by @michalkuligowski in #87
habana_main rebase v4 by @kzawora-intel in #85
Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
formatting fixes by @kzawora-intel in #132
Address upstream PR code review comments by @kzawora-intel in #133
Whitespace fix by @kzawora-intel in #134
Add torch.compile support by @kzawora-intel in #48
habana_main rebase v5 by @kzawora-intel in #108
Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
Remove redundant torch.device call by @kzawora-intel in #139
Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
Re-enable FusedRoPE by @kzawora-intel in #145
Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
Allocate blocks from id=1 for HPU by @kdamaszk in #160
Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
Reimplement silu_and_mul for mixtral by @jkaniecki in #167
Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
remove reminder_comment.yml by @kzawora-intel in #179
Fix logger initialization in ops.py by @kzawora-intel in #178
1.17 documentation update by @kzawora-intel in #172
Readme 1.17 update by @kzawora-intel in #186
Support FP8 INC in vLLM by @nirda7 in #144
[Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
split gptbigcode forward by @libinta in #194
Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
Enable LoRA support for HPU by @scsudhak-intel in #170
Compile mode bug fix for LoRA by @scsudhak-intel in #196
Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
Port not warmed-up configurations log warnings by @adobrzyn in #222
Remove mark step from static MoE loop by @jkaniecki in #231
Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
[bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
Remove token budget from decode buckets by @kzawora-intel in #241
[habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
Mask based BGMV implementation by @hlahkar in #223
Dispersed dummy slots by @madamczyk-intel in #243
Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
Hardcode fastapi version due to pydantic error by @hlahkar in #255
Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
Port flat PA from habana_next to habana_main by @dolszewska in #169
Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
Fix dispersed slots by @madamczyk-intel in #261
Skip compilation warnings during warmup phase by @jkaniecki in #262
Port PT Profiler to habana_main by @adobrzyn in #256
Fix Lo...

Contributors

CAC, leopck, and 87 other contributors

Assets 2

01 Jul 14:53

github-actions

v0.8.5.post1+Gaudi-1.21.2

9f1222c

v0.8.5.post1+Gaudi-1.21.2

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

Requirements

Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.21.2 and above

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.8.5.post1+Gaudi-1.21.2
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL.	Documentation Example HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism.	Documentation Running Pipeline Parallelism
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile	vLLM HPU backend supports inference with `torch.compile`.	vLLM HPU backend execution modes
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode)	Documentation
AutoAWQ quantization	vLLM HPU backend supports inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (functional releas...

Contributors

leopck, xuechendi, and 38 other contributors

Assets 2

19 May 14:07

github-actions

v0.7.2+Gaudi-1.21.0

0275ce4

v0.7.2+Gaudi-1.21.0

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

Requirements

Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.21.0 and above

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.7.2+Gaudi-1.21.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL.	Documentation Example HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism.	Documentation Running Pipeline Parallelism
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile	vLLM HPU backend supports inference with `torch.compile`.	vLLM HPU backend execution modes
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode)	Documentation
AutoAWQ quantization	vLLM HPU backend supports inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (functional release) ...

Contributors

CAC, xuechendi, and 76 other contributors

Assets 2

26 Feb 09:53

bartekkuncer

v0.6.6.post1+Gaudi-1.20.0

6af2f67

v0.6.6.post1+Gaudi-1.20.0

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.20.0

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

Ubuntu 22.04 LTS OS
Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.20.0 and above

Quick Start Using Dockerfile

Set up the container with latest release of Gaudi Software Suite using the Dockerfile:

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.6.post1+Gaudi-1.20.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single-node multi-HPU)	vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL.	Documentation Example HCCL reference
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile (experimental)	vLLM HPU backend experimentally supports inference with torch.compile.	vLLM HPU backend execution modes
Attention with Linear Biases (ALiBi)	vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b.	vLLM supported models
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC).	Documentation
AutoAWQ quantization	vLLM HPU backend supports the inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports the inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching (experimental)	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (functional release)	vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard `--speculative_model` and `--num_speculative_tokens` parameters.	Documentation Example
Multiprocessing backend	Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray.	Documentation

Unsupported Features

Beam s...

Contributors

xuechendi, yangw1234, and 56 other contributors

Assets 2

Releases: HabanaAI/vllm-fork

v0.9.0.1+Gaudi-1.23.0

vLLM with Intel® Gaudi® AI Accelerators

What's Changed

Contributors

Uh oh!

v0.9.0.1.post1+Gaudi-1.22.2

vLLM with Intel® Gaudi® AI Accelerators

What's Changed

We are providing the following fixes to mitigate identified security vulnerabilities in this release.

Contributors

Uh oh!

v0.9.0.1+Gaudi-1.22.2

vLLM with Intel® Gaudi® AI Accelerators

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.0.1+Gaudi-1.22.0

vLLM with Intel® Gaudi® AI Accelerators

Requirements and Installation

Requirements

Running vLLM on Gaudi with Docker Compose

Quick Start Using Dockerfile

Ubuntu

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from the vLLM main source

Supported Features

Contributors

Uh oh!

v0.8.5+Gaudi-1.22.0-aice-v0

What's Changed

Contributors

Uh oh!

v0.8.5.post1+Gaudi-1.21.3

What's Changed

Contributors

Uh oh!

v0.8.5+Gaudi-1.21.2-aice-v0

What's Changed

Contributors

Uh oh!

v0.8.5.post1+Gaudi-1.21.2

vLLM with Intel® Gaudi® AI Accelerators

Requirements and Installation

Requirements

Quick Start Using Dockerfile

Ubuntu

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from the vLLM main source

Supported Features

Contributors

Uh oh!

v0.7.2+Gaudi-1.21.0

vLLM with Intel® Gaudi® AI Accelerators

Requirements and Installation

Requirements

Quick Start Using Dockerfile

Ubuntu

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from the vLLM main source

Supported Features