Background
There has been an increasing need of deployment for DeepSeek V3.2 and DeepSeek V3.2 Speciale since their release. However, there are still some functionality/performance gap between DeepSeek V3.2 and DeepSeek V3.1.
Optimization Items
Parallelism
Kernel/Algorithm Optimization (Prior roadmap #11989)
MTP
Others
Related resources
Profiling command example
export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp-size 8 --dp-size 8 --enable-dp-attention --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3
# bs1
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2 --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile
# bs32
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2 --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 32 --num-prompts 32 --profile
Long context performance
Here is the performance comparison between DeepSeek V3.1 and V3.2 on long context lengths. (Collected by @XucSh)
From the figure, we can see that DeepSeek V3.2 shows advantage under long context like 32k, and PP helps a lot with latency.
Benchmark data (Updated on 12/11)
Here is the nightly performance data collected on 01/08 (8*B200, link):
deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8)
| batch size |
input len |
latency (s) |
input throughput (tok/s) |
output throughput (tok/s) |
acc length |
ITL (ms) |
input cost ($/1M) |
output cost ($/1M) |
profile (extend) |
profile (decode) |
| 1 |
4096 |
10.41 |
7003.29 |
52.13 |
n/a |
19.18 |
0.11 |
10.66 |
trace |
trace |
| 8 |
4096 |
11.20 |
32024.01 |
402.37 |
n/a |
19.88 |
0.02 |
1.38 |
trace |
trace |
| 16 |
4096 |
14.49 |
42795.90 |
632.33 |
n/a |
25.30 |
0.02 |
0.88 |
trace |
trace |
| 64 |
4096 |
19.70 |
43666.88 |
2392.64 |
n/a |
26.75 |
0.02 |
0.23 |
trace |
trace |
deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8+MTP)
| batch size |
input len |
latency (s) |
input throughput (tok/s) |
output throughput (tok/s) |
acc length |
ITL (ms) |
input cost ($/1M) |
output cost ($/1M) |
profile (extend) |
profile (decode) |
| 1 |
4096 |
5.98 |
10752.06 |
91.40 |
n/a |
10.94 |
0.07 |
6.08 |
trace |
trace |
| 8 |
4096 |
7.44 |
40049.51 |
618.72 |
2.67 |
12.93 |
0.02 |
0.90 |
trace |
trace |
| 16 |
4096 |
7.49 |
41491.37 |
1386.14 |
3.09 |
11.54 |
0.02 |
0.40 |
trace |
trace |
deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8)
| batch size |
input len |
latency (s) |
input throughput (tok/s) |
output throughput (tok/s) |
acc length |
ITL (ms) |
input cost ($/1M) |
output cost ($/1M) |
profile (extend) |
profile (decode) |
| 1 |
4096 |
7.23 |
4634.57 |
80.62 |
n/a |
12.40 |
0.17 |
6.89 |
trace |
trace |
| 8 |
4096 |
10.20 |
15694.80 |
504.66 |
n/a |
15.85 |
0.05 |
1.10 |
trace |
trace |
| 16 |
4096 |
13.44 |
15967.91 |
877.79 |
n/a |
18.23 |
0.05 |
0.63 |
trace |
trace |
| 64 |
4096 |
32.10 |
16129.85 |
2067.71 |
n/a |
30.95 |
0.05 |
0.27 |
trace |
trace |
deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8+MTP)
| batch size |
input len |
latency (s) |
input throughput (tok/s) |
output throughput (tok/s) |
acc length |
ITL (ms) |
input cost ($/1M) |
output cost ($/1M) |
profile (extend) |
profile (decode) |
| 1 |
4096 |
4.36 |
9093.46 |
130.92 |
3.07 |
7.64 |
0.09 |
4.24 |
trace |
trace |
| 8 |
4096 |
7.83 |
15420.75 |
717.38 |
3.0 |
11.15 |
0.05 |
0.77 |
trace |
trace |
| 16 |
4096 |
10.23 |
15590.51 |
1359.11 |
3.27 |
11.77 |
0.05 |
0.41 |
trace |
trace |
deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8)
| batch size |
input len |
latency (s) |
input throughput (tok/s) |
output throughput (tok/s) |
acc length |
ITL (ms) |
input cost ($/1M) |
output cost ($/1M) |
profile (extend) |
profile (decode) |
| 1 |
4096 |
4.16 |
22590.03 |
128.60 |
n/a |
7.78 |
0.04 |
4.32 |
trace |
trace |
| 8 |
4096 |
6.58 |
32987.73 |
733.29 |
n/a |
10.91 |
0.02 |
0.76 |
trace |
trace |
| 16 |
4096 |
8.92 |
40294.88 |
1123.31 |
n/a |
14.24 |
0.02 |
0.49 |
trace |
trace |
| 64 |
4096 |
18.83 |
41034.27 |
2632.78 |
n/a |
24.31 |
0.02 |
0.21 |
trace |
trace |
deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8+MTP)
| batch size |
input len |
latency (s) |
input throughput (tok/s) |
output throughput (tok/s) |
acc length |
ITL (ms) |
input cost ($/1M) |
output cost ($/1M) |
profile (extend) |
profile (decode) |
| 1 |
4096 |
2.27 |
26147.32 |
241.75 |
3.05 |
4.14 |
0.03 |
2.30 |
trace |
trace |
| 8 |
4096 |
5.78 |
37656.27 |
834.55 |
2.64 |
9.59 |
0.02 |
0.67 |
trace |
trace |
| 16 |
4096 |
8.00 |
38651.86 |
1299.46 |
2.56 |
12.31 |
0.02 |
0.43 |
trace |
trace |
These results show a huge space for performance improvement on DeepSeek V3.2 model families. On some common low latency cases (bs=1, isl=4k), the performance of V3.2 with TP8 (80 tps) is ~37% slower than the case of V3.1(128 tps) with TP8.