Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions .cd/Dockerfile.rhel.tenc.pytorch.vllm
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Parameterize base image components for RHEL 8.6/9.2/9.4/9.6 and TencentOS 3.1
# Copyright (C) 2025 Habana Labs, Ltd. an Intel Company
# SPDX-License-Identifier: Apache-2.0

# Parameterize base image components for RHEL 9.4/9.6 and TencentOS 3.1
ARG DOCKER_URL=vault.habana.ai/gaudi-docker
ARG VERSION=1.21.1
ARG BASE_NAME=rhel8.6
Expand All @@ -16,23 +19,17 @@ ENV BASE_NAME=${BASE_NAME}

ENV OMPI_MCA_btl_vader_single_copy_mechanism=none

# Install required packages for RHEL 8.6/9.x and TencentOS 3.1
# Install required packages for RHEL 9.x and TencentOS 3.1
RUN if echo "$BASE_NAME" | grep -qi "tencentos"; then \
yum remove -y mpitests_openmpi perftest openmpi opensm-libs || true && \
yum update -y --exclude=openmpi --exclude=opensm-libs && \
yum install -y gettext jq python3-pip git --allowerasing --exclude=openmpi --exclude=opensm-libs && \
ln -sf /usr/bin/python3 /usr/bin/python ; \
elif echo "$BASE_NAME" | grep -q "^rhel8"; then \
yum module reset perl -y && \
yum module enable perl:5.26 -y && \
yum update -y && \
yum install -y gettext jq python3-pip git --allowerasing && \
ln -sf /usr/bin/python3 /usr/bin/python ; \
else \
else \
yum update -y && \
yum install -y gettext jq python3-pip git --allowerasing && \
ln -sf /usr/bin/python3 /usr/bin/python ; \
fi
fi

WORKDIR /root

Expand Down
50 changes: 0 additions & 50 deletions .cd/Dockerfile.suse.pytorch.vllm

This file was deleted.

10 changes: 6 additions & 4 deletions .cd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="<docker image url>" \
VTENSOR_PARALLEL_SIZE=1 \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
Expand All @@ -147,15 +147,16 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam

```bash
HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
DOCKER_IMAGE="<docker image url>" \
VLLM_SERVER_CONFIG_FILE=server/server_scenarios_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
VLLM_BENCHMARK_CONFIG_FILE=benchmark/benchmark_scenarios_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up
```

> [!NOTE]
> When using configuration files, you do not need to set the `MODEL` environment variable, as the model name is specified within the configuration file. However, you must still provide your `HF_TOKEN`.
> When using configuration files, you do not need to set the `MODEL` environment variable, as the model name is specified within the configuration file. However, you must still provide your `HF_TOKEN` and `DOCKER_IMAGE`.

### 7. Running the Server Directly with Docker

Expand All @@ -175,6 +176,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-p 8000:8000 \
-e HF_HOME='mnt/hf_cache'
--name vllm-server \
<docker image name>
```
Expand Down
39 changes: 26 additions & 13 deletions README_GAUDI.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ For setups using an L3 switch, the `gaudinet.json` file must be mapped as descri

- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.21.0 and above
- Intel Gaudi software version 1.22.0 and above

## Running vLLM on Gaudi with Docker Compose
Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the [instruction](https://github.com/HabanaAI/vllm-fork/tree/v0.9.0.1%2BGaudi-1.22.0/.cd) to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.

## Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Expand All @@ -35,6 +38,10 @@ Runtime" section of [Docker Installation](https://docs.habana.ai/en/latest/Insta
Make sure you have ``habanalabs-container-runtime`` package installed and that ``habana`` container runtime is registered.

### Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
> [!NOTE]
> Prerequisite:
Starting from the 1.22.x Intel Gaudi software version, the RHEL Docker image must be created manually before running the command.
Additionally, the path to the Docker image must be updated in the Dockerfile.hpu.ubi file.

```
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
Expand Down Expand Up @@ -63,8 +70,8 @@ Refer to the [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Instal
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the [Support Matrix](https://docs.habana.ai/en/latest/Support_Matrix/Support_Matrix.html):

```{.console}
$ docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
```

### Build and Install vLLM
Expand All @@ -78,7 +85,7 @@ vLLM releases are being performed periodically to align with Intel® Gaudi® sof
```{.console}
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.7.2+Gaudi-1.21.0
$ git checkout v0.9.0.1+Gaudi-1.22.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
```
Expand Down Expand Up @@ -119,13 +126,13 @@ $ python setup.py develop
| Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)<br>[Example](https://docs.ray.io/en/master/serve/tutorials/vllm-example.html)<br>[HCCL reference](https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/index.html) |
| Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)<br> [Running Pipeline Parallelism](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#pipeline-parallelism) |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)<br>[vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes)<br>[Optimization guide](https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#hpu-graph-capture) |
| Inference with torch.compile | vLLM HPU backend supports inference with `torch.compile`. | [vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes) |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html) |
| Inference with torch.compile | vLLM HPU backend supports inference with `torch.compile` fully supports FP8 and BF16 precisions. | [vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes) |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html) |
| AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | [Library](https://github.com/casper-hansen/AutoAWQ) |
| AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | [Library](https://github.com/AutoGPTQ/AutoGPTQ) |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | [Documentation](https://docs.vllm.ai/en/stable/models/lora.html)<br>[Example](https://docs.vllm.ai/en/stable/getting_started/examples/multilora_inference.html)<br>[vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) |
| Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter. | [Feature RFC](https://github.com/vllm-project/vllm/issues/6854) |
| Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter. | [Documentation](https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html)<br>[Details](https://docs.vllm.ai/en/stable/automatic_prefix_caching/details.html) |
| Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter. | [Documentation](https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html)<br>[Details](https://docs.vllm.ai/en/stable/automatic_prefix_caching/details.html) |
| Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurable via standard `--speculative_model` and `--num_speculative_tokens` parameters. (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.vllm.ai/en/stable/models/spec_decode.html)<br>[Example](https://docs.vllm.ai/en/stable/getting_started/examples/mlpspeculator.html) |
| Multiprocessing backend | Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray. | [Documentation](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) |
| Multimodal | vLLM HPU backend supports the inference for multi-modal models. (Not fully supported with t.compile execution mode) | [Documentation](https://docs.vllm.ai/en/latest/serving/multimodal_inputs.html) |
Expand All @@ -134,9 +141,10 @@ $ python setup.py develop
| Guided decode | vLLM HPU supports a guided decoding backend for generating structured outputs. | [Documentation](https://docs.vllm.ai/en/latest/features/structured_outputs.html) |
| Delayed Sampling (experimental) | vLLM HPU supports delayed sampling scheduling for asynchronous execution, enabled by `VLLM_DELAYED_SAMPLING=true` environment variable. | N/A |
| Exponential bucketing | vLLM HPU supports exponential bucketing spacing instead of linear to automate configuration of bucketing mechanism, enabled by default. It can be disabled via `VLLM_EXPONENTIAL_BUCKETING=false` environment variable. | N/A |
| Torchrun offline inference | Enabled support for tensor-parallel inference with torchrun on Gaudi | N/A |

> [!NOTE]
> All specified features are also supported with the `-- enforce-eager` flag.
> All specified features are also supported with the `--enforce-eager` flag.

# Unsupported Features

Expand All @@ -163,6 +171,8 @@ The following configurations have been validated to function with Gaudi 2 or Gau
| [meta-llama/Llama-3.2-90B-Vision](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision) | 4, 8 (min. for Gaudi 2) | BF16, FP8 | Gaudi 2, Gaudi 3|
| [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) | 4, 8 (min. for Gaudi 2) | BF16 | Gaudi 2, Gaudi 3 |
| [meta-llama/Meta-Llama-3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B) | 4 | BF16, FP8 | Gaudi 3|
| [meta-llama/Meta-Llama-4-Scout-17B-16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E) | 4, 8 | BF16, FP8 | Gaudi 3|
| [meta-llama/Meta-Llama-4-Maverick-17Bx128E](https://huggingface.co/meta-llama/Llama-4-maverick-17B-128E) | 8 | BF16, FP8 | Gaudi 3|
| [meta-llama/Granite-3B-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) | 1 | BF16 | Gaudi 3|
| [meta-llama/Granite-3.0-8B-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3|
| [meta-llama/Granite-20B-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) | 1 | BF16, FP8 | Gaudi 2, Gaudi 3|
Expand All @@ -171,9 +181,12 @@ The following configurations have been validated to function with Gaudi 2 or Gau
| [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 1, 2 | BF16 | Gaudi 2|
| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | 2 | FP8, BF16 |Gaudi 2, Gaudi 3|
| [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 1, 8 | BF16 | Gaudi 2, Gaudi 3 |
| [princeton-nlp/gemma-2-9b-it-SimPO](https://huggingface.co/princeton-nlp/gemma-2-9b-it-SimPO) | 1 | BF16 |Gaudi 2, Gaudi 3|
| [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | 8 | BF16 |Gaudi 2|
| [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | 8 | BF16 |Gaudi 2|
| [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | 8 | BF16, FP8 |Gaudi 2, Gaudi 3|
| [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | 8 | BF16, FP8 |Gaudi 2, Gaudi 3|
| [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | 4 | BF16, FP8 |Gaudi 2, Gaudi 3|
| [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) | 4, 8 | BF16, FP8 |Gaudi 2, Gaudi 3|
| [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | 8 | BF16 |Gaudi 3|
| [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | 8 | BF16 |Gaudi 3|
| [meta-llama/CodeLlama-34b-Instruct-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Instruct-hf) | 1 | BF16 |Gaudi 3|
| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)<br> [quick start scripts](https://github.com/HabanaAI/vllm-fork/blob/deepseek_r1/scripts/DEEPSEEK_R1_ON_GAUDI.md) | 8 | FP8, BF16 |Gaudi 2, Gaudi 3|

Expand Down Expand Up @@ -372,7 +385,7 @@ batch size is often at its maximum, making large-batch HPU graphs critical to ca
> [!TIP]
> When a deployed workload does not utilize the full context that a model can handle, it is good practice to limit the maximum values upfront based on the input and output token lengths that will be generated after serving the vLLM server.
<br><br>**Example:**<br>Let's assume that we want to deploy text generation model Qwen2.5-1.5B, which has a defined `max_position_embeddings` of 131072 (our `max_model_len`). At the same time, we know that our workload pattern will not use the full context length because we expect a maximum input token size of 1K and predict generating a maximum of 2K tokens as output. In this case, starting the vLLM server to be ready for the full context length is unnecessary. Instead, we should limit it upfront to achieve faster service preparation and decrease warm-up time. The recommended values in this example should be:
> - `--max_model_len`: `3072` - the sum of input and output sequences (1+2)*1024.
> - `--max-model-len`: `3072` - the sum of input and output sequences (1+2)*1024.
> - `VLLM_PROMPT_SEQ_BUCKET_MAX`: `1024` - the maximum input token size that we expect to handle.

<br/>**Additional Performance Tuning Knobs - Linear Bucketing Strategy only:**
Expand Down Expand Up @@ -419,7 +432,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
# Quantization, FP8 Inference and Model Calibration Process

> [!NOTE]
> Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in detail in [docs.habana.ai vLLM Inference Section](https://docs.habana.ai/en/v1.21.0/PyTorch/Inference_on_PyTorch/vLLM_Inference/vLLM_FP8_Inference.html).
> Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in detail in [docs.habana.ai vLLM Inference Section](https://docs.habana.ai/en/latest/PyTorch/vLLM_Inference/vLLM_FP8_Inference.html).
An end-to-end example tutorial for quantizing a BF16 Llama 3.1 model to FP8 and then inferencing is provided in this [Gaudi-tutorials repository](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/vLLM_Tutorials/FP8_Quantization_using_INC/FP8_Quantization_using_INC.ipynb).

Once you have completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command:
Expand Down
5 changes: 4 additions & 1 deletion docker/Dockerfile.hpu
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
FROM vault.habana.ai/gaudi-docker/1.20.1/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
# Copyright (C) 2025 Habana Labs, Ltd. an Intel Company
# SPDX-License-Identifier: Apache-2.0

FROM vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest

COPY ./ /workspace/vllm

Expand Down
8 changes: 7 additions & 1 deletion docker/Dockerfile.hpu.ubi
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
ARG BASE_IMAGE=vault.habana.ai/gaudi-docker/1.20.0/rhel9.4/habanalabs/pytorch-installer-2.6.0:latest
# Copyright (C) 2025 Habana Labs, Ltd. an Intel Company
# SPDX-License-Identifier: Apache-2.0

## Starting from the 1.22.x Intel Gaudi software version, we no longer provide RHEL Docker images.
## These must be created manually, and the path below should be updated accordingly.
ARG BASE_IMAGE=vault.habana.ai/gaudi-docker/1.21.4/rhel9.4/habanalabs/pytorch-installer-2.6.0:latest

FROM ${BASE_IMAGE} as habana-base

USER root
Expand Down
Loading