[deepseek r1] HPU support for deepseek by xuechendi · Pull Request #1030 · HabanaAI/vllm-fork

xuechendi · 2025-04-08T15:19:40Z

migrated from a PR to habana_main: #1014

For Best performance, this PR is recommended to run with INC:
[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs

test acc of G3:

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output"
}


QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
VLLM_MLA_DISABLE_REQUANTIZATION=1 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \
  --tasks gsm8k --num_fewshot "5" --limit "256" \
  --batch_size "8"

test acc of G2:
convert original DeepSeek-R1 using convert_for_g2.py (this step will be removed as INC updates.)

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}

vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9492	±	0.0137
		strict-match	5	exact_match	↑	0.9453	±	0.0142

Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1

Status:

runnable with Deepseek-R1.
Accuracy check: for block fp8 weight => garbage output
accuracy check for BF16 weight => looks good.

test scripts:

from vllm import LLM, SamplingParams
import os

os.environ['VLLM_SKIP_WARMUP'] = 'true'
os.environ['PT_HPU_LAZY_MODE'] = '1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
os.environ['PT_HPU_WEIGHT_SHARING']='0'
#os.environ['HABANA_LOGS']="vllm_inc_debug"
#os.environ["LOG_LEVEL_ALL"]="3"
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
#os.environ["QUANT_CONFIG"] = "inc_quant_with_fp8kv_config.json"
#os.environ["LOGLEVEL"] = "DEBUG"

 
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
 
if __name__ == "__main__":
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True)
 
    # Create an LLM.
    model_path = "/data/models/DeepSeek-R1"
 
    llm = LLM(model=model_path,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="bfloat16",
            use_v2_block_manager=True,
            max_model_len=1024,
            max_num_seqs=1,
            tensor_parallel_size=8,
            distributed_executor_backend='mp',
            gpu_memory_utilization=0.8,
            #kv_cache_dtype="fp8_inc",
            seed=2024)
 
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    if os.environ.get("QUANT_CONFIG", None) is not None:
        llm.llm_engine.model_executor.shutdown()

kwisniewski98

Please also make sure that precommit is passing. You'll also need to switch hpu extension commit sha temporarily to be able to run tests

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

madamczyk-intel · 2025-04-10T13:11:30Z

/run-gaudi-tests

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

madamczyk-intel · 2025-04-14T09:26:11Z

/run-gaudi-tests

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

afierka-intel · 2025-04-14T16:09:14Z

/run-gaudi-tests

afierka-intel · 2025-04-14T16:26:19Z

/run-gaudi-tests

done

michalkuligowski · 2025-04-15T07:06:45Z

@kwisniewski98 @xuechendi

we are waiting for lasts tests requested by @madamczyk-intel
isnt [SW-224465]Update post-process script for deepseek #139 vllm-hpu-extension#143 needed for this to work?

kwisniewski98 · 2025-04-15T10:35:34Z

@kwisniewski98 @xuechendi

we are waiting for lasts tests requested by @madamczyk-intel

isnt [SW-224465]Update post-process script for deepseek #139 vllm-hpu-extension#143 needed for this to work?

Mixtral accuracy tested with gsm8k seems to be the same as on 1.21 branch. Perf seems to be also the same. About t.compile it is out of scope.
It was merged

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

madamczyk-intel · 2025-04-15T11:16:03Z

/run-gaudi-tests

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

madamczyk-intel · 2025-04-15T11:34:02Z

/run-gaudi-tests

madamczyk-intel · 2025-04-15T12:59:04Z

/run-gaudi-tests

michalkuligowski · 2025-04-15T13:13:11Z

/run-gaudi-tests

migrated from a PR to habana_main: #1014 For Best performance, this PR is recommended to run with INC: [[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs](https://jira.habana-labs.com/browse/SW-223553) **test acc of G3**: ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output" } QUANT_CONFIG=inc_quant_with_fp8kv_config.json \ PT_HPU_LAZY_MODE=1 \ VLLM_SKIP_WARMUP=true \ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ PT_HPU_WEIGHT_SHARING=0 \ VLLM_MLA_DISABLE_REQUANTIZATION=1 \ lm_eval --model vllm \ --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \ --tasks gsm8k --num_fewshot "5" --limit "256" \ --batch_size "8" ``` **test acc of G2**: **convert original DeepSeek-R1** using [convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py) (this step will be removed as INC updates.) ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" } ``` vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142| ---------- Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1 Status: runnable with Deepseek-R1. Accuracy check: for block fp8 weight => garbage output accuracy check for BF16 weight => looks good. test scripts: ``` from vllm import LLM, SamplingParams import os os.environ['VLLM_SKIP_WARMUP'] = 'true' os.environ['PT_HPU_LAZY_MODE'] = '1' os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' os.environ['PT_HPU_WEIGHT_SHARING']='0' os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] if __name__ == "__main__": # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True) # Create an LLM. model_path = "/data/models/DeepSeek-R1" llm = LLM(model=model_path, trust_remote_code=True, enforce_eager=True, dtype="bfloat16", use_v2_block_manager=True, max_model_len=1024, max_num_seqs=1, tensor_parallel_size=8, distributed_executor_backend='mp', gpu_memory_utilization=0.8, #kv_cache_dtype="fp8_inc", seed=2024) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if os.environ.get("QUANT_CONFIG", None) is not None: llm.llm_engine.model_executor.shutdown() ``` --------- Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: kwisniewski98 <kwisniewski@habana.ai> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>

JIRA: https://jira.habana-labs.com/browse/SW-227174 cherry-pick #1030 and fixed conflicts after rebase Dependency: HabanaAI/vllm-hpu-extension#161 Verified with below 3 methods: 1. test with deepseek-v2 BF16 weight. => Passed 2. evaluate acc on deepseek-r1 with out of box block fp8 weight => Passed 3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC calibrated per-channel scale => Passed acc check, performance reach goal(number is in jira ticket) == Details == 1. test with deepseek-v2 BF16 weight: ``` PT_HPU_LAZY_MODE=1 python run_example_tp.py --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --osl 32 ``` ``` (VllmWorkerProcess pid=1039) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! (VllmWorkerProcess pid=1038) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! (VllmWorkerProcess pid=1041) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.57it/s, est. speed input: 12.59 toks/s, output: 50.37 toks/s] e2e took 2.5509743690199684 seconds ==================================== Prompt: 'Hello, my name is' Generated text: '\nI am a 20 year old student from the UK. I am currently studying for a degree in English Literature and Creative Writing at the University of East' Ground truth: None ==================================== ==================================== Prompt: '0.999 compares to 0.9 is ' Generated text: '100%\n0.9999999999999999999999999' Ground truth: None ==================================== ==================================== Prompt: 'The capital of France is' Generated text: ' Paris, which is also the largest city in the country. The city is located on the Seine River and is known for its beautiful architecture, museums, and art' Ground truth: None ==================================== ==================================== Prompt: 'The future of AI is' Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe' Ground truth: None ==================================== ``` 2. evaluate acc on deepseek-r1 with out of box block fp8 weight - limit 256 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9648|± |0.0115| | | |strict-match | 5|exact_match|↑ |0.9648|± |0.0115| 3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC calibrated per-channel scale |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9688|± |0.0109| | | |strict-match | 5|exact_match|↑ |0.9688|± |0.0109| --------- Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: kwisniewski98 <kwisniewski@habana.ai> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: kwisniewski98 <kwisniewski@habana.ai> Co-authored-by: Youlei Yang <youlei.yang@intel.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Yi Liu <yiliu4@habana.ai>

xuechendi requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz, michalkuligowski and vivekgoe as code owners April 8, 2025 15:19

xuechendi mentioned this pull request Apr 8, 2025

[Deepseek-R1] PR to habana main #1014

Closed

kwisniewski98 requested changes Apr 8, 2025

View reviewed changes

Comment thread vllm/attention/backends/hpu_attn.py Outdated

Comment thread vllm/worker/hpu_model_runner.py Outdated

Comment thread vllm/model_executor/models/deepseek_v2.py Outdated

xuechendi force-pushed the dev/chendi/deepseek_r1 branch from 0d069b4 to 8d17e10 Compare April 8, 2025 16:52

apply deepseek change

236ac10

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/chendi/deepseek_r1 branch from 8d17e10 to 35f352e Compare April 8, 2025 17:03

xuechendi requested a review from kwisniewski98 April 8, 2025 17:03

xuechendi force-pushed the dev/chendi/deepseek_r1 branch from c695238 to b147f2e Compare April 8, 2025 18:43

xuechendi added 4 commits April 8, 2025 22:03

update for mypy

d273848

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

fix acc issue

7cf1dcf

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

fix mypy

4560a09

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

update vllm-hpu-extension comit id for test

85f0693

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/chendi/deepseek_r1 branch from b147f2e to 85f0693 Compare April 8, 2025 22:03

madamczyk-intel reviewed Apr 10, 2025

View reviewed changes

yiliu30 mentioned this pull request Apr 10, 2025

[SW-224465]Update post-process script for deepseek HabanaAI/vllm-hpu-extension#139

Merged

michalkuligowski requested changes Apr 10, 2025

View reviewed changes

michalkuligowski previously requested changes Apr 11, 2025

View reviewed changes

Comment thread requirements-hpu.txt Outdated

yiliu30 mentioned this pull request Apr 11, 2025

[SW-224465]Update post-process script for deepseek #139 HabanaAI/vllm-hpu-extension#143

Merged

Add temporary workaround for V1

a93c26a

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

xuechendi changed the base branch from v1.21.0 to v1.21.0_next April 11, 2025 21:50

xuechendi self-assigned this Apr 12, 2025

xuechendi added 3 commits April 12, 2025 08:37

Resolve review comments

496938d

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

update dependent vllm-hpu-extension

a6358a5

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Merge branch 'v1.21.0_next' into dev/chendi/deepseek_r1

d362dd4

xuechendi requested review from madamczyk-intel and michalkuligowski April 12, 2025 05:53

michalkuligowski reviewed Apr 14, 2025

View reviewed changes

Comment thread vllm/v1/attention/backends/hpu_attn.py

Comment thread vllm/model_executor/models/deepseek_v2.py

Comment thread vllm/model_executor/models/deepseek_v2.py

kwisniewski98 added 2 commits April 14, 2025 18:32

Remove o_proj only for deepseek

b49caca

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

Change vllm-hpu-extension version

214bcae

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

Merge branch 'v1.21.0_next' into dev/chendi/deepseek_r1

3cb8b06

kwisniewski98 approved these changes Apr 14, 2025

View reviewed changes

michalkuligowski approved these changes Apr 15, 2025

View reviewed changes

Explicitly disable t.compile for deepseek

9b85748

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

Change method of checking lazy mode

1d7fb51

Signed-off-by: kwisniewski98 <kwisniewski@habana.ai>

madamczyk-intel approved these changes Apr 15, 2025

View reviewed changes

wpyszka enabled auto-merge (squash) April 15, 2025 13:55

michalkuligowski disabled auto-merge April 15, 2025 14:39

michalkuligowski merged commit 2edff28 into v1.21.0_next Apr 15, 2025
43 checks passed

xuechendi mentioned this pull request Apr 24, 2025

[Deepseek R1][v0] Porting deepseek r1 to habana_main #1161

Merged

Conversation

xuechendi commented Apr 8, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwisniewski98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madamczyk-intel commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madamczyk-intel commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afierka-intel commented Apr 14, 2025

Uh oh!

afierka-intel commented Apr 14, 2025

Uh oh!

michalkuligowski commented Apr 15, 2025

Uh oh!

kwisniewski98 commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madamczyk-intel commented Apr 15, 2025

Uh oh!

madamczyk-intel commented Apr 15, 2025

Uh oh!

madamczyk-intel commented Apr 15, 2025

Uh oh!

michalkuligowski commented Apr 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xuechendi commented Apr 8, 2025 •

edited by github-actions Bot

Loading

kwisniewski98 commented Apr 15, 2025 •

edited

Loading