-
Notifications
You must be signed in to change notification settings - Fork 4.5k
sync attention, deepseek doc #14335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync attention, deepseek doc #14335
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu | |||||
| |---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------| | ||||||
| | **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | ||||||
| | **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ||||||
| | **FA4 (FlashAttention 4)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | | ||||||
| | **FA4 (FlashAttention 4)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ | | ||||||
| | **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | ||||||
| | **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | ||||||
| | **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | ||||||
|
|
@@ -38,17 +38,13 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu | |||||
| | **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ | | ||||||
| | **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) | | ||||||
| | **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) | | ||||||
| | **FA4** | 128 | ❌ | ❌ | ❌ | ❌ | | ||||||
| | **FA4** | 1 | ❌ | ❌ | ❌ | ❌ | | ||||||
|
||||||
| | **FA4** | 1 | ❌ | ❌ | ❌ | ❌ | | |
| | **FA4** | 128 | ❌ | ❌ | ❌ | ❌ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(its actually like this.)
Fridge003 marked this conversation as resolved.
Show resolved
Hide resolved
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,20 +11,23 @@ To run DeepSeek V3.1/V3/R1 models, the recommended settings are as follows: | |
|
|
||
| | Weight Type | Configuration | | ||
| |------------|-------------------| | ||
| | **Full precision FP8**<br>*(recommended)* | 8 x H200 | | ||
| | **Full precision [FP8](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)**<br>*(recommended)* | 8 x H200 | | ||
| | | 8 x MI300X | | ||
| | | 2 x 8 x H100/800/20 | | ||
| | | Xeon 6980P CPU | | ||
| | **Full precision BF16** | 2 x 8 x H200 | | ||
| | **Full precision ([BF16](https://huggingface.co/unsloth/DeepSeek-R1-0528-BF16))** (upcast from original FP8) | 2 x 8 x H200 | | ||
| | | 2 x 8 x MI300X | | ||
| | | 4 x 8 x H100/800/20 | | ||
| | | 4 x 8 x A100/A800 | | ||
| | **Quantized weights (AWQ)** | 8 x H100/800/20 | | ||
| | | 8 x A100/A800 | | ||
| | **Quantized weights (int8)** | 16 x A100/800 | | ||
| | **Quantized weights ([INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8))** | 16 x A100/800 | | ||
| | | 32 x L40S | | ||
| | | Xeon 6980P CPU | | ||
| | | 2 x Atlas 800I A3 | | ||
| | **Quantized weights ([W4A8](https://huggingface.co/novita/Deepseek-R1-0528-W4AFP8))** | 8 x H20/100, 4 x H200 | | ||
| | **Quantized weights ([AWQ](https://huggingface.co/QuixiAI/DeepSeek-R1-0528-AWQ))** | 8 x H100/800/20 | | ||
| | | 8 x A100/A800 | | ||
| | **Quantized weights ([MXFP4](https://huggingface.co/amd/DeepSeek-R1-MXFP4-Preview))** | 8, 4 x MI355X/350X | | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i have personally tried the w4a8 + mxfp4 combinations, so they work fine. |
||
| | **Quantized weights ([NVFP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4))** | 8, 4 x B200 | | ||
|
|
||
| <style> | ||
| .md-typeset__table { | ||
|
|
@@ -55,29 +58,38 @@ To run DeepSeek V3.1/V3/R1 models, the recommended settings are as follows: | |
| } | ||
| </style> | ||
|
|
||
| ```{important} | ||
| The official DeepSeek V3 is already in FP8 format, so you should not run it with any quantization arguments like `--quantization fp8`. | ||
| ``` | ||
|
|
||
| Detailed commands for reference: | ||
|
|
||
| - [8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) | ||
| - [8 x MI300X](../platforms/amd_gpu.md#running-deepseek-v3) | ||
| - [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes) | ||
| - [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes) | ||
| - [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization) | ||
| - [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) | ||
| - [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization) | ||
| - [16 x A100 (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) | ||
| - [32 x L40S (INT8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization) | ||
| - [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1) | ||
| - [2 x Atlas 800I A3 (int8)](../platforms/ascend_npu.md#running-deepseek-v3) | ||
| - [2 x Atlas 800I A3 (INT8)](../platforms/ascend_npu.md#running-deepseek-v3) | ||
|
|
||
| ### Download Weights | ||
| If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights. | ||
|
|
||
| ### Launch with one node of 8 x H200 | ||
| Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#installation--launch). | ||
| **Note that Deepseek V3 is already in FP8**, so we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`. | ||
|
|
||
| ### Running examples on Multi-node | ||
| ### Running examples on Multi-Node | ||
b8zhong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - [Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP](https://lmsys.org/blog/2025-06-16-gb200-part-1/) ([Part I](https://lmsys.org/blog/2025-06-16-gb200-part-1/), [Part II](https://lmsys.org/blog/2025-09-25-gb200-part-2/)) - Comprehensive guide on GB200 optimizations. | ||
|
|
||
| - [Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs](https://lmsys.org/blog/2025-05-05-deepseek-pd-ep/) - Guide on PD disaggregation and large-scale EP. | ||
|
|
||
| - [Serving with two H20*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes). | ||
|
|
||
| - [Best Practices for Serving DeepSeek-R1 on H20](https://lmsys.org/blog/2025-09-26-sglang-ant-group/) - Comprehensive guide on H20 optimizations, deployment and performance. | ||
|
|
||
| - [Serving with two H200*8 nodes and docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker). | ||
|
|
||
| - [Serving with four A100*8 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes). | ||
|
|
@@ -104,7 +116,7 @@ Overall, with these optimizations, we have achieved up to **7x** acceleration in | |
| <img src="https://lmsys.org/images/blog/sglang_v0_3/deepseek_mla.svg" alt="Multi-head Latent Attention for DeepSeek Series Models"> | ||
| </p> | ||
|
|
||
| **Usage**: MLA optimization is enabled by default. For MLA models on Blackwell architecture (e.g., B200), the default backend is FlashInfer. To use the optimized TRTLLM MLA backend for prefill and decode operations, explicitly specify `--attention-backend trtllm_mla`. | ||
Fridge003 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| **Usage**: MLA optimization is enabled by default. | ||
|
|
||
| **Reference**: Check [Blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [Slides](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_deepseek_mla.pdf) for more details. | ||
|
|
||
|
|
@@ -123,12 +135,16 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin | |
| </p> | ||
|
|
||
| **Usage**: | ||
| - Append `--enable-dp-attention --tp 8 --dp 8` to the server arguments when using 8 H200 GPUs. This optimization improves peak throughput in high batch size scenarios where the server is limited by KV cache capacity. However, it is not recommended for low-latency, small-batch use cases. | ||
| - Append `--enable-dp-attention --tp 8 --dp 8` to the server arguments when using 8 H200 GPUs. This optimization improves peak throughput in high batch size scenarios where the server is limited by KV cache capacity. | ||
| - DP and TP attention can be flexibly combined. For example, to deploy DeepSeek-V3/R1 on 2 nodes with 8 H100 GPUs each, you can specify `--enable-dp-attention --tp 16 --dp 2`. This configuration runs attention with 2 DP groups, each containing 8 TP GPUs. | ||
|
|
||
| ```{caution} | ||
| Data parallelism attention is not recommended for low-latency, small-batch use cases. It is optimized for high-throughput scenarios with large batch sizes. | ||
| ``` | ||
|
|
||
| **Reference**: Check [Blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models). | ||
|
|
||
| ### Multi Node Tensor Parallelism | ||
| ### Multi-Node Tensor Parallelism | ||
|
|
||
| **Description**: For users with limited memory on a single node, SGLang supports serving DeepSeek Series Models, including DeepSeek V3, across multiple nodes using tensor parallelism. This approach partitions the model parameters across multiple GPUs or nodes to handle models that are too large for one node's memory. | ||
|
|
||
|
|
@@ -144,34 +160,36 @@ With data parallelism attention enabled, we have achieved up to **1.9x** decodin | |
|
|
||
| - **DeepGEMM**: The [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) kernel library optimized for FP8 matrix multiplications. | ||
|
|
||
| **Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is enabled by default on NVIDIA Hopper GPUs and disabled by default on other devices. DeepGEMM can also be manually turned off by setting the environment variable `SGLANG_ENABLE_JIT_DEEPGEMM=0`. | ||
| **Usage**: The activation and weight optimization above are turned on by default for DeepSeek V3 models. DeepGEMM is enabled by default on NVIDIA Hopper/Blackwell GPUs and disabled by default on other devices. DeepGEMM can also be manually turned off by setting the environment variable `SGLANG_ENABLE_JIT_DEEPGEMM=0`. | ||
|
|
||
| ```{tip} | ||
| Before serving the DeepSeek model, precompile the DeepGEMM kernels to improve first-run performance. The precompilation process typically takes around 10 minutes to complete. | ||
| ``` | ||
|
|
||
| Before serving the DeepSeek model, precompile the DeepGEMM kernels using: | ||
| ```bash | ||
| python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code | ||
| ``` | ||
| The precompilation process typically takes around 10 minutes to complete. | ||
|
|
||
| ### Multi-token Prediction | ||
| **Description**: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved by **1.8x** for batch size 1 and **1.5x** for batch size 32 respectively on H200 TP8 setting. | ||
|
|
||
| **Usage**: | ||
| Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example: | ||
| Add `--speculative-algorithm EAGLE`. Other flags, like `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` are optional. For example: | ||
| ``` | ||
| python3 -m sglang.launch_server \ | ||
| --model-path deepseek-ai/DeepSeek-V3-0324 \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 1 \ | ||
Fridge003 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 2 \ | ||
| --trust-remote-code \ | ||
| --tp 8 | ||
| ``` | ||
| - The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. | ||
| - FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development. | ||
Fridge003 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)): | ||
| - Adjust `--max-running-requests` to a larger number. The default value is `48` for MTP. For larger batch sizes, you should increase this value beyond the default value. | ||
| - Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it. | ||
| - The default configuration for DeepSeek models is `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4`. The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. | ||
| - Most MLA attention backends fully support MTP usage. See [MLA Backends](../advanced_features/attention_backend.md#mla-backends) for details. | ||
|
|
||
| ```{note} | ||
| To enable DeepSeek MTP for large batch sizes (>48), you need to adjust some parameters (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)): | ||
| - Adjust `--max-running-requests` to a larger number. The default value is `48` for MTP. For larger batch sizes, you should increase this value beyond the default value. | ||
| - Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The [default captured batch sizes for speculative decoding](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py#L888-L895) is 48. You can customize this by including more batch sizes. | ||
| ``` | ||
|
|
||
|
|
||
| ### Reasoning Content for DeepSeek R1 & V3.1 | ||
|
|
@@ -230,9 +248,11 @@ The client needs to concatenate all arguments fragments to reconstruct the compl | |
| ``` | ||
| {"city": "Qingdao"} | ||
| ``` | ||
| Important Notes: | ||
|
|
||
| ```{important} | ||
| 1. Use a lower `"temperature"` value for better results. | ||
| 2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt. | ||
| ``` | ||
|
|
||
|
|
||
| ### Thinking Budget for DeepSeek R1 | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.