[Roadmap] DeepSeek v3.2 Optimization

## Background
There has been an increasing need of deployment for [DeepSeek V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) and [DeepSeek V3.2 Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale) since their release. However, there are still some functionality/performance gap between DeepSeek V3.2 and DeepSeek V3.1.


## Optimization Items

### Parallelism
- [x] Initial support for Pure TP & Partial DP Attention #13646
- [x] Initial support for CP (CP + DP + EP, not compatible with TP)  #12065
- [x] CP optimization with fused moe/multi-batch/fp8 kvcache #13959 
- [x] TBO support  #14901
- [x] PP support & Optimization #15086 
- [x] CP+PP+TP https://github.com/sgl-project/sglang/issues/15358 #16380 

### Kernel/Algorithm Optimization (Prior roadmap #11989)
- [x] DeepGeMM fp8_mqa_logits upgrade #13402
- [ ] Update DeepGemm sgl_release branch to include commits from https://github.com/deepseek-ai/DeepGEMM/compare/main...nv_dev #13402
- [ ] Integrate FP8 per tensor sparse MLA kernel from trtllm(flashinfer) #18389 
- [x] Integrate BF16 per tensor sparse MLA kernel from trtllm(flashinfer) #16758 
- [ ] [Decode] Optimize dual stream in Indexer #13546 #16637
- [x] [Decode] Move deep_gemm.get_paged_mqa_logits_metadata to init time as metadata #15040
- [ ] [Decode] Optimize _get_topk_paged where there are a lot of small kernels #15104
- [x] [Prefill] Optimize _get_topk_ragged where there are a lot of small kernels.  #16043
- [ ] [Prefill] Optimize with masked MHA kernels for sequence with ISL > 2k #14498 https://github.com/sgl-project/sgl-flash-attn/pull/24
- [ ] Update flashmla to nv_dev branch: #15211 
- [x]  Allreduce+norm fusion (completed) #15310 
- [ ]  Support TP for Indexer when DP is not enabled
- [x] Hi-cache support #17415


### MTP
- [x] MTP (spec-v1) initial support #11652
- [x] Support pure TP+MTP #15088
- [x] Support Spec-V2 with overlap scheduler #15307
- [x] Reuse metadata for multi-step MTP #14781
- [ ] Enable nextn = 2/4 in deep_gemm.fp8_paged_mqa_logits, which is faster than the current implementation which uses the kernel with nextn = 1 regardless of mtp size.
- [x] Apply decode dsa kernels in target verify and draft extend #16961


### Others
- [x] Support of fp4 checkpoint after its release (on Blackwell) #17655
- [ ] Compatibility with PD disagg (MTP, TP, DP, EP) #14496
- [ ] ROCM optimization
- [ ] NPU optimization


## Related resources

- Model link: [DeepSeek V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2), [DeepSeek V3.2 Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale)
- Official document for DeepSeek V3.2 usage on SGLang: https://docs.sglang.io/basic_usage/deepseek_v32.html
- Bug tracking issue #14511
- Initial tracking issues for DeepSeek V3.2: #11060 #11100
- Prior issue for DSA kernel optimization: #11989


## Profiling command example

```
export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-V3.2   --trust-remote-code   --tp-size 8 --dp-size 8 --enable-dp-attention   --tool-call-parser deepseekv32   --reasoning-parser deepseek-v3
# bs1
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2 --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile
# bs32
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2  --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 32 --num-prompts 32 --profile
```

## Long context performance 

Here is the performance comparison between DeepSeek V3.1 and V3.2 on long context lengths. (Collected by @XucSh)

<img width="719" height="305" alt="Image" src="https://github.com/user-attachments/assets/7b071688-ed8c-4c9b-aee5-489fca41af87" />

<img width="3442" height="1398" alt="Image" src="https://github.com/user-attachments/assets/127ef471-eb4d-4c0e-9d60-bd2c62dcd979" />

From the figure, we can see that DeepSeek V3.2 shows advantage under long context like 32k, and PP helps a lot with latency.


## Benchmark data (Updated on 12/11)

Here is the nightly performance data collected on 01/08 (8*B200, [link](https://github.com/sgl-project/sglang/actions/runs/20801222093)):

### deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8)
| batch size | input len | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input cost ($/1M) | output cost ($/1M) | profile (extend) | profile (decode)|
| ---------- | --------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ----------------- | ------------------ | ---------------- | --------------- |
| 1 | 4096 | 10.41 | 7003.29 | 52.13 | n/a | 19.18 | 0.11 | 10.66 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887355.5563192%2Fbs-1-il-4096-1767887355.581749-TP-3-DP-3-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887355.5563192%2Fbs-1-il-4096-1767887355.581749-TP-3-DP-3-DECODE.trace.json.gz) |
| 8 | 4096 | 11.20 | 32024.01 | 402.37 | n/a | 19.88 | 0.02 | 1.38 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887376.951289%2Fbs-8-il-4096-1767887376.9771166-TP-0-DP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887376.951289%2Fbs-8-il-4096-1767887376.9771166-TP-0-DP-0-DECODE.trace.json.gz) |
| 16 | 4096 | 14.49 | 42795.90 | 632.33 | n/a | 25.30 | 0.02 | 0.88 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887412.5813231%2Fbs-16-il-4096-1767887412.6096559-TP-0-DP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887412.5813231%2Fbs-16-il-4096-1767887412.6096559-TP-0-DP-0-DECODE.trace.json.gz) |
| 64 | 4096 | 19.70 | 43666.88 | 2392.64 | n/a | 26.75 | 0.02 | 0.23 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887448.5839648%2Fbs-64-il-4096-1767887448.6131551-TP-0-DP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767887256%2F1767887448.5839648%2Fbs-64-il-4096-1767887448.6131551-TP-0-DP-0-DECODE.trace.json.gz) |

### deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8+MTP)
| batch size | input len | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input cost ($/1M) | output cost ($/1M) | profile (extend) | profile (decode)|
| ---------- | --------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ----------------- | ------------------ | ---------------- | --------------- |
| 1 | 4096 | 5.98 | 10752.06 | 91.40 | n/a | 10.94 | 0.07 | 6.08 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888020%2F1767888081.3135488%2Fbs-1-il-4096-1767888081.336384-TP-3-DP-3-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888020%2F1767888081.3135488%2Fbs-1-il-4096-1767888081.336384-TP-3-DP-3-DECODE.trace.json.gz) |
| 8 | 4096 | 7.44 | 40049.51 | 618.72 | 2.67 | 12.93 | 0.02 | 0.90 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888020%2F1767888099.2657795%2Fbs-8-il-4096-1767888099.2882407-TP-0-DP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888020%2F1767888099.2657795%2Fbs-8-il-4096-1767888099.2882407-TP-0-DP-0-DECODE.trace.json.gz) |
| 16 | 4096 | 7.49 | 41491.37 | 1386.14 | 3.09 | 11.54 | 0.02 | 0.40 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888020%2F1767888119.3911762%2Fbs-16-il-4096-1767888119.4178188-TP-0-DP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888020%2F1767888119.3911762%2Fbs-16-il-4096-1767888119.4178188-TP-0-DP-0-DECODE.trace.json.gz) |

### deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8)
| batch size | input len | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input cost ($/1M) | output cost ($/1M) | profile (extend) | profile (decode)|
| ---------- | --------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ----------------- | ------------------ | ---------------- | --------------- |
| 1 | 4096 | 7.23 | 4634.57 | 80.62 | n/a | 12.40 | 0.17 | 6.89 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888774.698443%2Fbs-1-il-4096-1767888774.717843-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888774.698443%2Fbs-1-il-4096-1767888774.717843-TP-0-DECODE.trace.json.gz) |
| 8 | 4096 | 10.20 | 15694.80 | 504.66 | n/a | 15.85 | 0.05 | 1.10 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888789.274214%2Fbs-8-il-4096-1767888789.2937624-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888789.274214%2Fbs-8-il-4096-1767888789.2937624-TP-0-DECODE.trace.json.gz) |
| 16 | 4096 | 13.44 | 15967.91 | 877.79 | n/a | 18.23 | 0.05 | 0.63 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888817.004413%2Fbs-16-il-4096-1767888817.0245984-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888817.004413%2Fbs-16-il-4096-1767888817.0245984-TP-0-DECODE.trace.json.gz) |
| 64 | 4096 | 32.10 | 16129.85 | 2067.71 | n/a | 30.95 | 0.05 | 0.27 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888858.1768382%2Fbs-64-il-4096-1767888858.1997554-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767888668%2F1767888858.1768382%2Fbs-64-il-4096-1767888858.1997554-TP-0-DECODE.trace.json.gz) |

### deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8+MTP)
| batch size | input len | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input cost ($/1M) | output cost ($/1M) | profile (extend) | profile (decode)|
| ---------- | --------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ----------------- | ------------------ | ---------------- | --------------- |
| 1 | 4096 | 4.36 | 9093.46 | 130.92 | 3.07 | 7.64 | 0.09 | 4.24 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767889422%2F1767889487.1520903%2Fbs-1-il-4096-1767889487.1678202-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767889422%2F1767889487.1520903%2Fbs-1-il-4096-1767889487.1678202-TP-0-DECODE.trace.json.gz) |
| 8 | 4096 | 7.83 | 15420.75 | 717.38 | 3.0 | 11.15 | 0.05 | 0.77 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767889422%2F1767889499.862803%2Fbs-8-il-4096-1767889499.879491-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767889422%2F1767889499.862803%2Fbs-8-il-4096-1767889499.879491-TP-0-DECODE.trace.json.gz) |
| 16 | 4096 | 10.23 | 15590.51 | 1359.11 | 3.27 | 11.77 | 0.05 | 0.41 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767889422%2F1767889525.9927397%2Fbs-16-il-4096-1767889526.011497-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.2_1767889422%2F1767889525.9927397%2Fbs-16-il-4096-1767889526.011497-TP-0-DECODE.trace.json.gz) |


### deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8)
| batch size | input len | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input cost ($/1M) | output cost ($/1M) | profile (extend) | profile (decode)|
| ---------- | --------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ----------------- | ------------------ | ---------------- | --------------- |
| 1 | 4096 | 4.16 | 22590.03 | 128.60 | n/a | 7.78 | 0.04 | 4.32 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879141.6636767%2Fbs-1-il-4096-1767879141.6826413-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879141.6636767%2Fbs-1-il-4096-1767879141.6826413-TP-0-DECODE.trace.json.gz) |
| 8 | 4096 | 6.58 | 32987.73 | 733.29 | n/a | 10.91 | 0.02 | 0.76 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879151.489118%2Fbs-8-il-4096-1767879151.5083635-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879151.489118%2Fbs-8-il-4096-1767879151.5083635-TP-0-DECODE.trace.json.gz) |
| 16 | 4096 | 8.92 | 40294.88 | 1123.31 | n/a | 14.24 | 0.02 | 0.49 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879168.6790302%2Fbs-16-il-4096-1767879168.7000957-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879168.6790302%2Fbs-16-il-4096-1767879168.7000957-TP-0-DECODE.trace.json.gz) |
| 64 | 4096 | 18.83 | 41034.27 | 2632.78 | n/a | 24.31 | 0.02 | 0.21 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879193.0361881%2Fbs-64-il-4096-1767879193.0590084-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879058%2F1767879193.0361881%2Fbs-64-il-4096-1767879193.0590084-TP-0-DECODE.trace.json.gz) |

### deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8+MTP)
| batch size | input len | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input cost ($/1M) | output cost ($/1M) | profile (extend) | profile (decode)|
| ---------- | --------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ----------------- | ------------------ | ---------------- | --------------- |
| 1 | 4096 | 2.27 | 26147.32 | 241.75 | 3.05 | 4.14 | 0.03 | 2.30 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879713%2F1767879769.5598757%2Fbs-1-il-4096-1767879769.5772557-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879713%2F1767879769.5598757%2Fbs-1-il-4096-1767879769.5772557-TP-0-DECODE.trace.json.gz) |
| 8 | 4096 | 5.78 | 37656.27 | 834.55 | 2.64 | 9.59 | 0.02 | 0.67 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879713%2F1767879778.3017495%2Fbs-8-il-4096-1767879778.32002-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879713%2F1767879778.3017495%2Fbs-8-il-4096-1767879778.32002-TP-0-DECODE.trace.json.gz) |
| 16 | 4096 | 8.00 | 38651.86 | 1299.46 | 2.56 | 12.31 | 0.02 | 0.43 | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879713%2F1767879794.3927221%2Fbs-16-il-4096-1767879794.410988-TP-0-EXTEND.trace.json.gz) | [trace](https://sglang-bot.github.io/sglang-ci-data/perfetto_relay.html?src=https%3A%2F%2Fraw.githubusercontent.com%2Fsglang-bot%2Fsglang-ci-data%2Fmain%2Ftraces%2F20801222093%2Fdeepseek-ai_DeepSeek-V3.1_1767879713%2F1767879794.3927221%2Fbs-16-il-4096-1767879794.410988-TP-0-DECODE.trace.json.gz) |

These results show a huge space for performance improvement on DeepSeek V3.2 model families. On some common low latency cases (bs=1, isl=4k), the performance of V3.2 with TP8 (80 tps) is **~37%** slower than the case of V3.1(128 tps) with TP8.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] DeepSeek v3.2 Optimization #15025

Background

Optimization Items

Parallelism

Kernel/Algorithm Optimization (Prior roadmap #11989)

MTP

Others

Related resources

Profiling command example

Long context performance

Benchmark data (Updated on 12/11)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8+MTP)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8+MTP)

deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8)

deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8+MTP)

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

batch size	input len	latency (s)	input throughput (tok/s)	output throughput (tok/s)	acc length	ITL (ms)	input cost ($/1M)	output cost ($/1M)	profile (extend)	profile (decode)
1	4096	10.41	7003.29	52.13	n/a	19.18	0.11	10.66	trace	trace
8	4096	11.20	32024.01	402.37	n/a	19.88	0.02	1.38	trace	trace
16	4096	14.49	42795.90	632.33	n/a	25.30	0.02	0.88	trace	trace
64	4096	19.70	43666.88	2392.64	n/a	26.75	0.02	0.23	trace	trace

batch size	input len	latency (s)	input throughput (tok/s)	output throughput (tok/s)	acc length	ITL (ms)	input cost ($/1M)	output cost ($/1M)	profile (extend)	profile (decode)
1	4096	5.98	10752.06	91.40	n/a	10.94	0.07	6.08	trace	trace
8	4096	7.44	40049.51	618.72	2.67	12.93	0.02	0.90	trace	trace
16	4096	7.49	41491.37	1386.14	3.09	11.54	0.02	0.40	trace	trace

batch size	input len	latency (s)	input throughput (tok/s)	output throughput (tok/s)	acc length	ITL (ms)	input cost ($/1M)	output cost ($/1M)	profile (extend)	profile (decode)
1	4096	7.23	4634.57	80.62	n/a	12.40	0.17	6.89	trace	trace
8	4096	10.20	15694.80	504.66	n/a	15.85	0.05	1.10	trace	trace
16	4096	13.44	15967.91	877.79	n/a	18.23	0.05	0.63	trace	trace
64	4096	32.10	16129.85	2067.71	n/a	30.95	0.05	0.27	trace	trace

batch size	input len	latency (s)	input throughput (tok/s)	output throughput (tok/s)	acc length	ITL (ms)	input cost ($/1M)	output cost ($/1M)	profile (extend)	profile (decode)
1	4096	4.36	9093.46	130.92	3.07	7.64	0.09	4.24	trace	trace
8	4096	7.83	15420.75	717.38	3.0	11.15	0.05	0.77	trace	trace
16	4096	10.23	15590.51	1359.11	3.27	11.77	0.05	0.41	trace	trace

batch size	input len	latency (s)	input throughput (tok/s)	output throughput (tok/s)	acc length	ITL (ms)	input cost ($/1M)	output cost ($/1M)	profile (extend)	profile (decode)
1	4096	4.16	22590.03	128.60	n/a	7.78	0.04	4.32	trace	trace
8	4096	6.58	32987.73	733.29	n/a	10.91	0.02	0.76	trace	trace
16	4096	8.92	40294.88	1123.31	n/a	14.24	0.02	0.49	trace	trace
64	4096	18.83	41034.27	2632.78	n/a	24.31	0.02	0.21	trace	trace

batch size	input len	latency (s)	input throughput (tok/s)	output throughput (tok/s)	acc length	ITL (ms)	input cost ($/1M)	output cost ($/1M)	profile (extend)	profile (decode)
1	4096	2.27	26147.32	241.75	3.05	4.14	0.03	2.30	trace	trace
8	4096	5.78	37656.27	834.55	2.64	9.59	0.02	0.67	trace	trace
16	4096	8.00	38651.86	1299.46	2.56	12.31	0.02	0.43	trace	trace

[Roadmap] DeepSeek v3.2 Optimization #15025

Description

Background

Optimization Items

Parallelism

Kernel/Algorithm Optimization (Prior roadmap #11989)

MTP

Others

Related resources

Profiling command example

Long context performance

Benchmark data (Updated on 12/11)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8+MTP)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8)

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8+MTP)

deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8)

deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8+MTP)

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions