Skip to content

[Roadmap] DeepSeek v3.2 Optimization #15025

@Fridge003

Description

@Fridge003

Background

There has been an increasing need of deployment for DeepSeek V3.2 and DeepSeek V3.2 Speciale since their release. However, there are still some functionality/performance gap between DeepSeek V3.2 and DeepSeek V3.1.

Optimization Items

Parallelism

Kernel/Algorithm Optimization (Prior roadmap #11989)

MTP

Others

Related resources

Profiling command example

export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-V3.2   --trust-remote-code   --tp-size 8 --dp-size 8 --enable-dp-attention   --tool-call-parser deepseekv32   --reasoning-parser deepseek-v3
# bs1
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2 --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile
# bs32
python3 -m sglang.bench_serving --model deepseek-ai/DeepSeek-V3.2  --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 32 --num-prompts 32 --profile

Long context performance

Here is the performance comparison between DeepSeek V3.1 and V3.2 on long context lengths. (Collected by @XucSh)

Image Image

From the figure, we can see that DeepSeek V3.2 shows advantage under long context like 32k, and PP helps a lot with latency.

Benchmark data (Updated on 12/11)

Here is the nightly performance data collected on 01/08 (8*B200, link):

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8)

batch size input len latency (s) input throughput (tok/s) output throughput (tok/s) acc length ITL (ms) input cost ($/1M) output cost ($/1M) profile (extend) profile (decode)
1 4096 10.41 7003.29 52.13 n/a 19.18 0.11 10.66 trace trace
8 4096 11.20 32024.01 402.37 n/a 19.88 0.02 1.38 trace trace
16 4096 14.49 42795.90 632.33 n/a 25.30 0.02 0.88 trace trace
64 4096 19.70 43666.88 2392.64 n/a 26.75 0.02 0.23 trace trace

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (DP8+MTP)

batch size input len latency (s) input throughput (tok/s) output throughput (tok/s) acc length ITL (ms) input cost ($/1M) output cost ($/1M) profile (extend) profile (decode)
1 4096 5.98 10752.06 91.40 n/a 10.94 0.07 6.08 trace trace
8 4096 7.44 40049.51 618.72 2.67 12.93 0.02 0.90 trace trace
16 4096 7.49 41491.37 1386.14 3.09 11.54 0.02 0.40 trace trace

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8)

batch size input len latency (s) input throughput (tok/s) output throughput (tok/s) acc length ITL (ms) input cost ($/1M) output cost ($/1M) profile (extend) profile (decode)
1 4096 7.23 4634.57 80.62 n/a 12.40 0.17 6.89 trace trace
8 4096 10.20 15694.80 504.66 n/a 15.85 0.05 1.10 trace trace
16 4096 13.44 15967.91 877.79 n/a 18.23 0.05 0.63 trace trace
64 4096 32.10 16129.85 2067.71 n/a 30.95 0.05 0.27 trace trace

deepseek-ai/DeepSeek-V3.2 [8-gpu-b200] (TP8+MTP)

batch size input len latency (s) input throughput (tok/s) output throughput (tok/s) acc length ITL (ms) input cost ($/1M) output cost ($/1M) profile (extend) profile (decode)
1 4096 4.36 9093.46 130.92 3.07 7.64 0.09 4.24 trace trace
8 4096 7.83 15420.75 717.38 3.0 11.15 0.05 0.77 trace trace
16 4096 10.23 15590.51 1359.11 3.27 11.77 0.05 0.41 trace trace

deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8)

batch size input len latency (s) input throughput (tok/s) output throughput (tok/s) acc length ITL (ms) input cost ($/1M) output cost ($/1M) profile (extend) profile (decode)
1 4096 4.16 22590.03 128.60 n/a 7.78 0.04 4.32 trace trace
8 4096 6.58 32987.73 733.29 n/a 10.91 0.02 0.76 trace trace
16 4096 8.92 40294.88 1123.31 n/a 14.24 0.02 0.49 trace trace
64 4096 18.83 41034.27 2632.78 n/a 24.31 0.02 0.21 trace trace

deepseek-ai/DeepSeek-V3.1 [8-gpu-b200] (TP8+MTP)

batch size input len latency (s) input throughput (tok/s) output throughput (tok/s) acc length ITL (ms) input cost ($/1M) output cost ($/1M) profile (extend) profile (decode)
1 4096 2.27 26147.32 241.75 3.05 4.14 0.03 2.30 trace trace
8 4096 5.78 37656.27 834.55 2.64 9.59 0.02 0.67 trace trace
16 4096 8.00 38651.86 1299.46 2.56 12.31 0.02 0.43 trace trace

These results show a huge space for performance improvement on DeepSeek V3.2 model families. On some common low latency cases (bs=1, isl=4k), the performance of V3.2 with TP8 (80 tps) is ~37% slower than the case of V3.1(128 tps) with TP8.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Good Pro IssueIssues for experienced contributors; requires a solid understanding of SGLang internals.deepseekgood second issueGood for second time and more contributorshigh priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions