[Deepseek R1][v0] Porting deepseek r1 to habana_main#1161
[Deepseek R1][v0] Porting deepseek r1 to habana_main#1161xuechendi merged 11 commits intohabana_mainfrom
Conversation
bbb7ff2 to
4163dd0
Compare
|
/run-gaudi-tests |
|
I verified failed CI locally, and they are working OK Please help to review HabanaAI/vllm-hpu-extension#161 firstly |
|
@xuechendi you can update requiremets/hpu.txt for the related extension change to be utilized. We cant merge it before it is tested in this PR |
572a2ff to
187b6fb
Compare
|
/run-gaudi-tests |
|
/skip-gaudi-tests - two multimodal jobs failing is false negative, the undrlying tests are passing |
kwisniewski98
left a comment
There was a problem hiding this comment.
There is a lot of changes that I'm not sure if are upstreamable, we'd have to ask maintainers.
Also, I see workarounds that were introduced by me, about which I'm also not sure if we can introduce to main, i. e. MLA not supporting V1 and issues with torch compile
This PR is aming to cherry-pick 1.21 deepseek support to habana_main. I don't want to push more changes to this PR to make the impl being very different to the one we merged for 1.21. |
|
/run-gaudi-tests |
migrated from a PR to habana_main: #1014 For Best performance, this PR is recommended to run with INC: [[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs](https://jira.habana-labs.com/browse/SW-223553) **test acc of G3**: ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output" } QUANT_CONFIG=inc_quant_with_fp8kv_config.json \ PT_HPU_LAZY_MODE=1 \ VLLM_SKIP_WARMUP=true \ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ PT_HPU_WEIGHT_SHARING=0 \ VLLM_MLA_DISABLE_REQUANTIZATION=1 \ lm_eval --model vllm \ --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \ --tasks gsm8k --num_fewshot "5" --limit "256" \ --batch_size "8" ``` **test acc of G2**: **convert original DeepSeek-R1** using [convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py) (this step will be removed as INC updates.) ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" } ``` vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142| ---------- Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1 Status: runnable with Deepseek-R1. Accuracy check: for block fp8 weight => garbage output accuracy check for BF16 weight => looks good. test scripts: ``` from vllm import LLM, SamplingParams import os os.environ['VLLM_SKIP_WARMUP'] = 'true' os.environ['PT_HPU_LAZY_MODE'] = '1' os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' os.environ['PT_HPU_WEIGHT_SHARING']='0' os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] if __name__ == "__main__": # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True) # Create an LLM. model_path = "/data/models/DeepSeek-R1" llm = LLM(model=model_path, trust_remote_code=True, enforce_eager=True, dtype="bfloat16", use_v2_block_manager=True, max_model_len=1024, max_num_seqs=1, tensor_parallel_size=8, distributed_executor_backend='mp', gpu_memory_utilization=0.8, #kv_cache_dtype="fp8_inc", seed=2024) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if os.environ.get("QUANT_CONFIG", None) is not None: llm.llm_engine.model_executor.shutdown() ``` --------- Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: kwisniewski98 <kwisniewski@habana.ai> Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: kwisniewski98 <kwisniewski@habana.ai>
Use fp32 `gating_output` instead of adding `mark_step()` to fix the accuracy issues in 117555d . This will reduce the graph replay duration from ~41ms to ~32ms for decoding phase of 16k/1k bs=16 benchmark on Gaudi2.
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
… multiple cards (#1100) - Add `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` for speed up the warmup stage. - Fix the `dist.barrier` issue for single card cc @xuechendi @thuang6 --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai>
Previously it was only checking if it is using quant_config and choosing VllmMixtureOfExpertsOpFP8 as OP, which only difference is that when measuring scales it is assuming block quant. This will only happen when we are using Fp8MoEMethod as quant_method. Kwargs in moe_op call had to be disabled, beacuse of different apis of FP8 and unquantized --------- Signed-off-by: kwisniewski98 <kwisniewski@habana.ai> Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
a808b51 to
27e267c
Compare
|
/run-gaudi-tests |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
/run-gaudi-tests |

JIRA: https://jira.habana-labs.com/browse/SW-227174
cherry-pick #1030 and fixed conflicts after rebase
Dependency:
HabanaAI/vllm-hpu-extension#161HabanaAI/vllm-hpu-extension#170
Verified with below 3 methods:
== Details ==
run calibration
{ "method": "HOOKS", "mode": "MEASURE", "observer": "maxabs", "whitelist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": ["lm_head", "mlp\\.gate\\b"] }, "quantize_weight": false, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" }run test