[Deepseek R1][v0] Porting deepseek r1 to habana_main by xuechendi · Pull Request #1161 · HabanaAI/vllm-fork

xuechendi · 2025-04-24T21:47:35Z

JIRA: https://jira.habana-labs.com/browse/SW-227174

cherry-pick #1030 and fixed conflicts after rebase
Dependency: ~~HabanaAI/vllm-hpu-extension#161~~
HabanaAI/vllm-hpu-extension#170

Verified with below 3 methods:

test with deepseek-v2 BF16 weight. => Passed
evaluate acc on deepseek-r1 with out of box block fp8 weight => Passed
evaluate acc on deepseek-r1 with out of box block fp8 weight + INC calibrated per-channel scale => Passed acc check, performance reach goal(number is in jira ticket)

== Details ==

test with deepseek-v2 BF16 weight:

PT_HPU_LAZY_MODE=1 python run_example_tp.py --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --osl 32

(VllmWorkerProcess pid=1039) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1038) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1041) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.57it/s, est. speed input: 12.59 toks/s, output: 50.37 toks/s]
e2e took 2.5509743690199684 seconds
====================================
Prompt: 'Hello, my name is'
Generated text: '\nI am a 20 year old student from the UK. I am currently studying for a degree in English Literature and Creative Writing at the University of East'
Ground truth: None
====================================
====================================
Prompt: '0.999 compares to 0.9 is '
Generated text: '100%\n0.9999999999999999999999999'
Ground truth: None
====================================
====================================
Prompt: 'The capital of France is'
Generated text: ' Paris, which is also the largest city in the country. The city is located on the Seine River and is known for its beautiful architecture, museums, and art'
Ground truth: None
====================================
====================================
Prompt: 'The future of AI is'
Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe'
Ground truth: None
====================================

evaluate acc on deepseek-r1 with out of box block fp8 weight - limit 256

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9648	±	0.0115
		strict-match	5	exact_match	↑	0.9648	±	0.0115

evaluate acc on deepseek-r1 with out of box block fp8 weight + INC calibrated per-channel scale

run calibration

{
    "method": "HOOKS",
    "mode": "MEASURE",
    "observer": "maxabs",
    "whitelist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": ["lm_head", "mlp\\.gate\\b"]
    },
    "quantize_weight": false,
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}

OFFICIAL_FP8_MODEL=DeepSeek-R1

PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
QUANT_CONFIG=inc_measure_with_fp8kv_config.json \
python run_example_tp.py --model ${OFFICIAL_FP8_MODEL} --tokenizer ${OFFICIAL_FP8_MODEL} --osl 32 --max_num_seqs 1 --nprompts 512 --dataset pile

run test

QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_DISABLE_MARK_SCALES_AS_CONST=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc,enable_expert_parallel=True,max_num_seqs=256" \
  --tasks gsm8k --num_fewshot "5" \
  --batch_size "256" --limit 16 --log_samples --output_path fp8_gsm8k.json

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9688	±	0.0109
		strict-match	5	exact_match	↑	0.9688	±	0.0109

xuechendi · 2025-04-24T23:10:56Z

@kzawora-intel @michalkuligowski @kwisniewski98 @madamczyk-intel

michalkuligowski · 2025-04-25T12:46:05Z

/run-gaudi-tests

xuechendi · 2025-04-25T16:28:23Z

I verified failed CI locally, and they are working OK
And I realized that since we haven't get HabanaAI/vllm-hpu-extension#161 merged firstly, so it is lacking vllm-hpu-extension functions in this PR as well.

Please help to review HabanaAI/vllm-hpu-extension#161 firstly

michalkuligowski · 2025-04-29T10:45:40Z

@xuechendi you can update requiremets/hpu.txt for the related extension change to be utilized. We cant merge it before it is tested in this PR

xuechendi · 2025-05-05T22:17:32Z

/run-gaudi-tests

xuechendi · 2025-05-05T22:28:29Z

CI failed on getting resource:

michalkuligowski · 2025-05-06T07:28:57Z

/skip-gaudi-tests - two multimodal jobs failing is false negative, the undrlying tests are passing

kwisniewski98

There is a lot of changes that I'm not sure if are upstreamable, we'd have to ask maintainers.
Also, I see workarounds that were introduced by me, about which I'm also not sure if we can introduce to main, i. e. MLA not supporting V1 and issues with torch compile

xuechendi · 2025-05-06T20:43:22Z

There is a lot of changes that I'm not sure if are upstreamable, we'd have to ask maintainers. Also, I see workarounds that were introduced by me, about which I'm also not sure if we can introduce to main, i. e. MLA not supporting V1 and issues with torch compile

This PR is aming to cherry-pick 1.21 deepseek support to habana_main. I don't want to push more changes to this PR to make the impl being very different to the one we merged for 1.21.
Let's aim V1 support and torch compile in next PR.
I hope to get this PR merged, so llama4 and Qwen3 can share the same MOE changes.

xuechendi · 2025-05-06T20:58:06Z

/run-gaudi-tests

migrated from a PR to habana_main: #1014 For Best performance, this PR is recommended to run with INC: [[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs](https://jira.habana-labs.com/browse/SW-223553) **test acc of G3**: ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output" } QUANT_CONFIG=inc_quant_with_fp8kv_config.json \ PT_HPU_LAZY_MODE=1 \ VLLM_SKIP_WARMUP=true \ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ PT_HPU_WEIGHT_SHARING=0 \ VLLM_MLA_DISABLE_REQUANTIZATION=1 \ lm_eval --model vllm \ --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \ --tasks gsm8k --num_fewshot "5" --limit "256" \ --batch_size "8" ``` **test acc of G2**: **convert original DeepSeek-R1** using [convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py) (this step will be removed as INC updates.) ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" } ``` vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142| ---------- Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1 Status: runnable with Deepseek-R1. Accuracy check: for block fp8 weight => garbage output accuracy check for BF16 weight => looks good. test scripts: ``` from vllm import LLM, SamplingParams import os os.environ['VLLM_SKIP_WARMUP'] = 'true' os.environ['PT_HPU_LAZY_MODE'] = '1' os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' os.environ['PT_HPU_WEIGHT_SHARING']='0' os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] if __name__ == "__main__": # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True) # Create an LLM. model_path = "/data/models/DeepSeek-R1" llm = LLM(model=model_path, trust_remote_code=True, enforce_eager=True, dtype="bfloat16", use_v2_block_manager=True, max_model_len=1024, max_num_seqs=1, tensor_parallel_size=8, distributed_executor_backend='mp', gpu_memory_utilization=0.8, #kv_cache_dtype="fp8_inc", seed=2024) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if os.environ.get("QUANT_CONFIG", None) is not None: llm.llm_engine.model_executor.shutdown() ``` --------- Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: kwisniewski98 <kwisniewski@habana.ai> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>

Use fp32 `gating_output` instead of adding `mark_step()` to fix the accuracy issues in 117555d . This will reduce the graph replay duration from ~41ms to ~32ms for decoding phase of 16k/1k bs=16 benchmark on Gaudi2.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

@xuechendi

… multiple cards (#1100) - Add `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` for speed up the warmup stage. - Fix the `dist.barrier` issue for single card cc @xuechendi @thuang6 --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai>

Previously it was only checking if it is using quant_config and choosing VllmMixtureOfExpertsOpFP8 as OP, which only difference is that when measuring scales it is assuming block quant. This will only happen when we are using Fp8MoEMethod as quant_method. Kwargs in moe_op call had to be disabled, beacuse of different apis of FP8 and unquantized --------- Signed-off-by: kwisniewski98 <kwisniewski@habana.ai> Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2025-05-07T14:47:46Z

/run-gaudi-tests

jikunshang

q_proj removed. need change.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2025-05-07T20:47:53Z

/run-gaudi-tests

xuechendi requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz-intel, michalkuligowski and vivekgoe as code owners April 24, 2025 21:47

xuechendi force-pushed the dev/vllmfork_deepseek_r1 branch 2 times, most recently from bbb7ff2 to 4163dd0 Compare April 24, 2025 23:00

michalkuligowski mentioned this pull request Apr 29, 2025

[DeepseekR1] porting 1.21 deepseek changes to main (#137) HabanaAI/vllm-hpu-extension#161

Merged

xuechendi requested a review from jikunshang as a code owner May 2, 2025 19:36

xuechendi force-pushed the dev/vllmfork_deepseek_r1 branch from 572a2ff to 187b6fb Compare May 5, 2025 22:17

xuechendi enabled auto-merge (squash) May 5, 2025 22:17

jikunshang reviewed May 6, 2025

View reviewed changes

Comment thread vllm/worker/hpu_model_runner.py Outdated

jikunshang reviewed May 6, 2025

View reviewed changes

Comment thread vllm/worker/hpu_model_runner.py Outdated

kwisniewski98 reviewed May 6, 2025

View reviewed changes

Comment thread vllm/attention/backends/hpu_attn.py

Comment thread vllm/model_executor/layers/fused_moe/fused_moe.py Outdated

xuechendi and others added 6 commits May 7, 2025 17:43

[deepseek r1] remove mark_step in grouped_topk() (#1113)

381afd7

Use fp32 `gating_output` instead of adding `mark_step()` to fix the accuracy issues in 117555d . This will reduce the graph replay duration from ~41ms to ~32ms for decoding phase of 16k/1k bs=16 benchmark on Gaudi2.

Fix after rebase

59960be

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Fix attribute not found issue

79d6db4

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi added 2 commits May 7, 2025 17:44

Update hpu.txt

4edb831

fix per comments

27e267c

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/vllmfork_deepseek_r1 branch from a808b51 to 27e267c Compare May 7, 2025 14:45

xuechendi requested a review from mswiniarsk as a code owner May 7, 2025 14:45

jikunshang approved these changes May 7, 2025

View reviewed changes

jikunshang requested changes May 7, 2025

View reviewed changes

xuechendi added 3 commits May 7, 2025 21:52

resolve rebase after upstream PR vllm-project#17484

b45cacc

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

remove fp8 process since recent PR from upstream

94d959a

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Add UT for deepseek-v2-lite to test with MLA

541159b

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi requested a review from jikunshang May 7, 2025 22:01

xuechendi mentioned this pull request May 7, 2025

[llama/qwen3/deepseek related] Enable Expert Parallel and custom_routing_func for HPU #1225

Closed

jikunshang approved these changes May 8, 2025

View reviewed changes

xuechendi merged commit ae79743 into habana_main May 8, 2025
43 checks passed

xuechendi deleted the dev/vllmfork_deepseek_r1 branch May 8, 2025 00:07

xuechendi mentioned this pull request May 15, 2025

Add support of Deepseek calibration HabanaAI/vllm-hpu-extension#177

Merged

Conversation

xuechendi commented Apr 24, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Apr 24, 2025

Uh oh!

michalkuligowski commented Apr 25, 2025

Uh oh!

xuechendi commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michalkuligowski commented Apr 29, 2025

Uh oh!

xuechendi commented May 5, 2025

Uh oh!

xuechendi commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

michalkuligowski commented May 6, 2025

Uh oh!

kwisniewski98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xuechendi commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented May 6, 2025

Uh oh!

xuechendi commented May 7, 2025

Uh oh!

jikunshang left a comment

Choose a reason for hiding this comment

Uh oh!

xuechendi commented May 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xuechendi commented Apr 24, 2025 •

edited by github-actions Bot

Loading

xuechendi commented Apr 25, 2025 •

edited

Loading

xuechendi commented May 6, 2025 •

edited

Loading