Skip to content

[deepseek r1] HPU support for deepseek#1030

Merged
michalkuligowski merged 14 commits intov1.21.0_nextfrom
dev/chendi/deepseek_r1
Apr 15, 2025
Merged

[deepseek r1] HPU support for deepseek#1030
michalkuligowski merged 14 commits intov1.21.0_nextfrom
dev/chendi/deepseek_r1

Conversation

@xuechendi
Copy link
Copy Markdown

@xuechendi xuechendi commented Apr 8, 2025

migrated from a PR to habana_main: #1014

For Best performance, this PR is recommended to run with INC:
[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs

test acc of G3:

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output"
}


QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
VLLM_MLA_DISABLE_REQUANTIZATION=1 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \
  --tasks gsm8k --num_fewshot "5" --limit "256" \
  --batch_size "8"

test acc of G2:
convert original DeepSeek-R1 using convert_for_g2.py (this step will be removed as INC updates.)

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}

vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9492 ± 0.0137
strict-match 5 exact_match 0.9453 ± 0.0142

Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1

Status:

runnable with Deepseek-R1.
Accuracy check: for block fp8 weight => garbage output
accuracy check for BF16 weight => looks good.

test scripts:

from vllm import LLM, SamplingParams
import os

os.environ['VLLM_SKIP_WARMUP'] = 'true'
os.environ['PT_HPU_LAZY_MODE'] = '1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
os.environ['PT_HPU_WEIGHT_SHARING']='0'
#os.environ['HABANA_LOGS']="vllm_inc_debug"
#os.environ["LOG_LEVEL_ALL"]="3"
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
#os.environ["QUANT_CONFIG"] = "inc_quant_with_fp8kv_config.json"
#os.environ["LOGLEVEL"] = "DEBUG"

 
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
 
if __name__ == "__main__":
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True)
 
    # Create an LLM.
    model_path = "/data/models/DeepSeek-R1"
 
    llm = LLM(model=model_path,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="bfloat16",
            use_v2_block_manager=True,
            max_model_len=1024,
            max_num_seqs=1,
            tensor_parallel_size=8,
            distributed_executor_backend='mp',
            gpu_memory_utilization=0.8,
            #kv_cache_dtype="fp8_inc",
            seed=2024)
 
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    if os.environ.get("QUANT_CONFIG", None) is not None:
        llm.llm_engine.model_executor.shutdown()

Copy link
Copy Markdown

@kwisniewski98 kwisniewski98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also make sure that precommit is passing. You'll also need to switch hpu extension commit sha temporarily to be able to run tests

Comment thread vllm/attention/backends/hpu_attn.py Outdated
Comment thread vllm/worker/hpu_model_runner.py Outdated
Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
@xuechendi xuechendi force-pushed the dev/chendi/deepseek_r1 branch from 0d069b4 to 8d17e10 Compare April 8, 2025 16:52
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the dev/chendi/deepseek_r1 branch from 8d17e10 to 35f352e Compare April 8, 2025 17:03
@xuechendi xuechendi requested a review from kwisniewski98 April 8, 2025 17:03
@xuechendi xuechendi force-pushed the dev/chendi/deepseek_r1 branch from c695238 to b147f2e Compare April 8, 2025 18:43
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the dev/chendi/deepseek_r1 branch from b147f2e to 85f0693 Compare April 8, 2025 22:03
Comment thread vllm/attention/backends/hpu_attn.py
Comment thread vllm/attention/backends/hpu_attn.py Outdated
Comment thread vllm/attention/backends/hpu_attn.py Outdated
Comment thread vllm/attention/backends/hpu_attn.py
Comment thread vllm/model_executor/layers/fused_moe/fused_moe.py
Comment thread vllm/model_executor/layers/fused_moe/layer.py
Comment thread vllm/model_executor/layers/fused_moe/layer.py
Comment thread vllm/worker/hpu_model_runner.py
@madamczyk-intel
Copy link
Copy Markdown

/run-gaudi-tests

Comment thread vllm/attention/backends/hpu_attn.py Outdated
Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
Comment thread vllm/worker/hpu_worker.py Outdated
Comment thread requirements-hpu.txt Outdated
Comment thread requirements-hpu.txt Outdated
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
@xuechendi xuechendi changed the base branch from v1.21.0 to v1.21.0_next April 11, 2025 21:50
@xuechendi xuechendi self-assigned this Apr 12, 2025
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@madamczyk-intel
Copy link
Copy Markdown

/run-gaudi-tests

Comment thread vllm/v1/attention/backends/hpu_attn.py
Comment thread vllm/model_executor/models/deepseek_v2.py
Comment thread vllm/model_executor/models/deepseek_v2.py
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
@afierka-intel
Copy link
Copy Markdown

/run-gaudi-tests

@afierka-intel
Copy link
Copy Markdown

/run-gaudi-tests

@michalkuligowski
Copy link
Copy Markdown

@kwisniewski98 @xuechendi

  1. we are waiting for lasts tests requested by @madamczyk-intel
  2. isnt [SW-224465]Update post-process script for deepseek #139 vllm-hpu-extension#143 needed for this to work?

@kwisniewski98
Copy link
Copy Markdown

kwisniewski98 commented Apr 15, 2025

@kwisniewski98 @xuechendi

  1. we are waiting for lasts tests requested by @madamczyk-intel
  2. isnt [SW-224465]Update post-process script for deepseek #139 vllm-hpu-extension#143 needed for this to work?
  1. Mixtral accuracy tested with gsm8k seems to be the same as on 1.21 branch. Perf seems to be also the same. About t.compile it is out of scope.
  2. It was merged

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
@madamczyk-intel
Copy link
Copy Markdown

/run-gaudi-tests

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
@madamczyk-intel
Copy link
Copy Markdown

/run-gaudi-tests

2 similar comments
@madamczyk-intel
Copy link
Copy Markdown

/run-gaudi-tests

@michalkuligowski
Copy link
Copy Markdown

/run-gaudi-tests

@wpyszka wpyszka enabled auto-merge (squash) April 15, 2025 13:55
@michalkuligowski michalkuligowski merged commit 2edff28 into v1.21.0_next Apr 15, 2025
43 checks passed
xuechendi added a commit that referenced this pull request Apr 24, 2025
migrated from a PR to habana_main:
#1014

For Best performance, this PR is recommended to run with INC:
[[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana
Labs](https://jira.habana-labs.com/browse/SW-223553)

**test acc of G3**:
```bash
huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output"
}

QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
VLLM_MLA_DISABLE_REQUANTIZATION=1 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \
  --tasks gsm8k --num_fewshot "5" --limit "256" \
  --batch_size "8"
```

**test acc of G2**:
**convert original DeepSeek-R1** using
[convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py)
(this step will be removed as INC updates.)

```bash

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}
```

vllm
(pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc),
gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137|
| | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142|

----------
Need to use vllm-hpu-extension:
https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1

Status:

runnable with Deepseek-R1.
Accuracy check: for block fp8 weight => garbage output
accuracy check for BF16 weight => looks good.

test scripts:
```
from vllm import LLM, SamplingParams
import os

os.environ['VLLM_SKIP_WARMUP'] = 'true'
os.environ['PT_HPU_LAZY_MODE'] = '1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
os.environ['PT_HPU_WEIGHT_SHARING']='0'
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

if __name__ == "__main__":
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True)

    # Create an LLM.
    model_path = "/data/models/DeepSeek-R1"

    llm = LLM(model=model_path,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="bfloat16",
            use_v2_block_manager=True,
            max_model_len=1024,
            max_num_seqs=1,
            tensor_parallel_size=8,
            distributed_executor_backend='mp',
            gpu_memory_utilization=0.8,
            #kv_cache_dtype="fp8_inc",
            seed=2024)

    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    if os.environ.get("QUANT_CONFIG", None) is not None:
        llm.llm_engine.model_executor.shutdown()
```

---------

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>
xuechendi added a commit that referenced this pull request May 5, 2025
migrated from a PR to habana_main:
#1014

For Best performance, this PR is recommended to run with INC:
[[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana
Labs](https://jira.habana-labs.com/browse/SW-223553)

**test acc of G3**:
```bash
huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output"
}

QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
VLLM_MLA_DISABLE_REQUANTIZATION=1 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \
  --tasks gsm8k --num_fewshot "5" --limit "256" \
  --batch_size "8"
```

**test acc of G2**:
**convert original DeepSeek-R1** using
[convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py)
(this step will be removed as INC updates.)

```bash

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}
```

vllm
(pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc),
gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137|
| | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142|

----------
Need to use vllm-hpu-extension:
https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1

Status:

runnable with Deepseek-R1.
Accuracy check: for block fp8 weight => garbage output
accuracy check for BF16 weight => looks good.

test scripts:
```
from vllm import LLM, SamplingParams
import os

os.environ['VLLM_SKIP_WARMUP'] = 'true'
os.environ['PT_HPU_LAZY_MODE'] = '1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
os.environ['PT_HPU_WEIGHT_SHARING']='0'
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

if __name__ == "__main__":
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True)

    # Create an LLM.
    model_path = "/data/models/DeepSeek-R1"

    llm = LLM(model=model_path,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="bfloat16",
            use_v2_block_manager=True,
            max_model_len=1024,
            max_num_seqs=1,
            tensor_parallel_size=8,
            distributed_executor_backend='mp',
            gpu_memory_utilization=0.8,
            #kv_cache_dtype="fp8_inc",
            seed=2024)

    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    if os.environ.get("QUANT_CONFIG", None) is not None:
        llm.llm_engine.model_executor.shutdown()
```

---------

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>
xuechendi added a commit that referenced this pull request May 7, 2025
migrated from a PR to habana_main:
#1014

For Best performance, this PR is recommended to run with INC:
[[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana
Labs](https://jira.habana-labs.com/browse/SW-223553)

**test acc of G3**:
```bash
huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output"
}

QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
VLLM_MLA_DISABLE_REQUANTIZATION=1 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \
  --tasks gsm8k --num_fewshot "5" --limit "256" \
  --batch_size "8"
```

**test acc of G2**:
**convert original DeepSeek-R1** using
[convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py)
(this step will be removed as INC updates.)

```bash

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}
```

vllm
(pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc),
gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137|
| | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142|

----------
Need to use vllm-hpu-extension:
https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1

Status:

runnable with Deepseek-R1.
Accuracy check: for block fp8 weight => garbage output
accuracy check for BF16 weight => looks good.

test scripts:
```
from vllm import LLM, SamplingParams
import os

os.environ['VLLM_SKIP_WARMUP'] = 'true'
os.environ['PT_HPU_LAZY_MODE'] = '1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
os.environ['PT_HPU_WEIGHT_SHARING']='0'
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

if __name__ == "__main__":
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True)

    # Create an LLM.
    model_path = "/data/models/DeepSeek-R1"

    llm = LLM(model=model_path,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="bfloat16",
            use_v2_block_manager=True,
            max_model_len=1024,
            max_num_seqs=1,
            tensor_parallel_size=8,
            distributed_executor_backend='mp',
            gpu_memory_utilization=0.8,
            #kv_cache_dtype="fp8_inc",
            seed=2024)

    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    if os.environ.get("QUANT_CONFIG", None) is not None:
        llm.llm_engine.model_executor.shutdown()
```

---------

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>
xuechendi added a commit that referenced this pull request May 8, 2025
JIRA: https://jira.habana-labs.com/browse/SW-227174

cherry-pick #1030 and fixed conflicts after rebase
Dependency: HabanaAI/vllm-hpu-extension#161

Verified with below 3 methods:

1. test with deepseek-v2 BF16 weight. => Passed
2. evaluate acc on deepseek-r1 with out of box block fp8 weight =>
Passed
3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC
calibrated per-channel scale => Passed acc check, performance reach
goal(number is in jira ticket)

== Details ==

1. test with deepseek-v2 BF16 weight:
```
PT_HPU_LAZY_MODE=1 python run_example_tp.py --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --osl 32 
```
```
(VllmWorkerProcess pid=1039) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1038) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1041) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.57it/s, est. speed input: 12.59 toks/s, output: 50.37 toks/s]
e2e took 2.5509743690199684 seconds
====================================
Prompt: 'Hello, my name is'
Generated text: '\nI am a 20 year old student from the UK. I am currently studying for a degree in English Literature and Creative Writing at the University of East'
Ground truth: None
====================================
====================================
Prompt: '0.999 compares to 0.9 is '
Generated text: '100%\n0.9999999999999999999999999'
Ground truth: None
====================================
====================================
Prompt: 'The capital of France is'
Generated text: ' Paris, which is also the largest city in the country. The city is located on the Seine River and is known for its beautiful architecture, museums, and art'
Ground truth: None
====================================
====================================
Prompt: 'The future of AI is'
Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe'
Ground truth: None
====================================
```

2. evaluate acc on deepseek-r1 with out of box block fp8 weight - limit
256

|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9648|± |0.0115|
| | |strict-match | 5|exact_match|↑ |0.9648|± |0.0115|

3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC
calibrated per-channel scale

|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9688|± |0.0109|
| | |strict-match | 5|exact_match|↑ |0.9688|± |0.0109|

---------

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>
Co-authored-by: Youlei Yang <youlei.yang@intel.com>
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Co-authored-by: Yi Liu <yiliu4@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants