[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by zminglei · Pull Request #12000 · sgl-project/sglang

zminglei · 2025-10-23T04:06:57Z

Motivation

Part of this Issue: #10278

As part of deepseek deterministic inference support, this change ensures deterministic inference results for deepseek arch models on a single GPU.

Modifications

Fixed the AttnForwardMethod as MLA instead of determining it at runtime based on batch status.
Replace torch.bmm with batch_invariant_bmm when enable deterministic inference.

Currently only support fa3 and triton, will follow up to support flashinfer backend as well later.

Accuracy Tests

Test with sglang-ci-dsv3-test (deepseek_v3 fp8) model on a H100

FA3 Backend

Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.660
Invalid: 0.000
Latency: 11.559 s
Output throughput: 1515.855 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 298, Unique samples: 2
Prompt 1 with prefix length 511: total samples: 345, Unique samples: 8
Prompt 2 with prefix length 2048: total samples: 324, Unique samples: 4
Prompt 3 with prefix length 4097: total samples: 308, Unique samples: 8

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
200/200 [00:10<00:00, 19.36it/s]
Accuracy: 0.655
Invalid: 0.000
Latency: 16.573 s
Output throughput: 1067.717 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 355, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 291, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 292, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 337, Unique samples: 1

# Server log shows radix cache is been used
[2025-10-23 09:38:07] Prefill batch. #new-seq: 1, #new-token: 776, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-23 09:38:07] Prefill batch. #new-seq: 23, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 11,
[2025-10-23 09:38:07] Prefill batch. #new-seq: 12, #new-token: 489, #cached-token: 3045, token usage: 0.01, #running-req: 23, #queue-req: 0,

Triton Backend

Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.655
Invalid: 0.000
Latency: 14.259 s
Output throughput: 1310.498 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 308, Unique samples: 2
Prompt 1 with prefix length 511: total samples: 294, Unique samples: 3
Prompt 2 with prefix length 2048: total samples: 314, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 359, Unique samples: 5

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.630
Invalid: 0.000
Latency: 34.322 s
Output throughput: 517.445 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 300, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 299, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 332, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 344, Unique samples: 1

# Server log shows radix cache is been used
[2025-10-23 09:45:53] Prefill batch. #new-seq: 1, #new-token: 776, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-23 09:45:54] Prefill batch. #new-seq: 21, #new-token: 8002, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 23,
[2025-10-23 09:45:54] Prefill batch. #new-seq: 23, #new-token: 23, #cached-token: 5952, token usage: 0.01, #running-req: 22, #queue-req: 0,

Test with DeepSeek-Coder-V2-Lite-Instruct (deepseek_v2 arch) model on a H100

FA3 Backend
Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
200/200 [00:10<00:00, 19.36it/s]
Accuracy: 0.830
Invalid: 0.000
Latency: 10.394 s
Output throughput: 2498.193 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 317, Unique samples: 3
Prompt 1 with prefix length 511: total samples: 307, Unique samples: 10
Prompt 2 with prefix length 2048: total samples: 320, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 331, Unique samples: 17

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
200/200 [00:10<00:00, 19.36it/s]
Accuracy: 0.830
Invalid: 0.000
Latency: 14.150 s
Output throughput: 1844.425 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 355, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 291, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 292, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 337, Unique samples: 1

Triton Backend
Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.805
Invalid: 0.000
Latency: 11.297 s
Output throughput: 2276.343 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 323, Unique samples: 3
Prompt 1 with prefix length 511: total samples: 337, Unique samples: 9
Prompt 2 with prefix length 2048: total samples: 296, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 319, Unique samples: 13

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.800
Invalid: 0.000
Latency: 16.252 s
Output throughput: 1609.897 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 300, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 299, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 332, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 344, Unique samples: 1

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

hebiao064 · 2025-10-23T06:22:32Z

can you add server log to demonstrate radix cache is being used?

zminglei · 2025-10-23T16:49:02Z

can you add server log to demonstrate radix cache is being used?

Sure, I added server log for sglang-ci-dsv3-test to demonstrate the radix cache is being used when enable deterministic.

hebiao064 · 2025-10-23T17:43:05Z

I made a minor change on the det unit test: #12022 This change allow subclasses to easily override the test model.

after this pr merged and you validated dsv3 works with 12000, pls add dsv3-test into unit test

python/sglang/srt/models/deepseek_v2.py

Fridge003 · 2025-10-23T19:24:10Z

@zminglei Can we handle the default attention backend for dpsk here?
https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py#L1517
Since flashinfer is currenlty not supported for dpsk, but it will be set by default on Blackwell

… into dpsk-deterministic

zminglei · 2025-10-24T01:49:51Z

@zminglei Can we handle the default attention backend for dpsk here? https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py#L1517 Since flashinfer is currenlty not supported for dpsk, but it will be set by default on Blackwell
updated it, pls take another look

zminglei added 2 commits October 22, 2025 20:48

support deterministic for deepseek model

382c1b4

fix lint

cce7992

zminglei changed the title ~~support deterministic for deepseek model~~ support deterministic inference for deepseek model Oct 23, 2025

zminglei added 2 commits October 22, 2025 22:20

add bmm replacement

b689285

Merge branch 'main' into dpsk-deterministic

d35a65c

zminglei marked this pull request as ready for review October 23, 2025 05:31

Merge branch 'main' into dpsk-deterministic

6009e14

hebiao064 added the run-ci label Oct 23, 2025

hebiao064 approved these changes Oct 23, 2025

View reviewed changes

Merge branch 'main' into dpsk-deterministic

a5c6f1b

Fridge003 reviewed Oct 23, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

zminglei added 3 commits October 23, 2025 15:47

update

f093f51

Merge branch 'dpsk-deterministic' of https://github.com/zminglei/sglang…

b35e406

… into dpsk-deterministic

fix server_args

8d5fb3d

Fridge003 approved these changes Oct 24, 2025

View reviewed changes

zminglei added 3 commits October 23, 2025 19:12

retrigger lint check

7916478

Merge branch 'main' into dpsk-deterministic

4c24f88

fix lint

4534ee3

zminglei requested review from Ying1123, hnyls2002, merrymercy and zhyncs as code owners October 24, 2025 02:16

hebiao064 changed the title ~~support deterministic inference for deepseek model~~ support deterministic inference for deepseek model on hopper Oct 24, 2025

hebiao064 approved these changes Oct 24, 2025

View reviewed changes

Merge branch 'main' into dpsk-deterministic

32fb5e7

fix lint

702d99e

zminglei changed the title ~~support deterministic inference for deepseek model on hopper~~ [1/N] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU Oct 24, 2025

zminglei changed the title ~~[1/N] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU~~ [1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU Oct 24, 2025

Fridge003 merged commit f4b78d1 into sgl-project:main Oct 24, 2025
100 of 107 checks passed

zminglei mentioned this pull request Oct 24, 2025

[2/2] Deepseek deterministic: support deepseek v3 deterministic inference on 8 x H200 #12095

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU#12000

[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU#12000
Fridge003 merged 14 commits intosgl-project:mainfrom
zminglei:dpsk-deterministic

zminglei commented Oct 23, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Oct 23, 2025

Uh oh!

zminglei commented Oct 23, 2025

Uh oh!

hebiao064 commented Oct 23, 2025

Uh oh!

Uh oh!

Fridge003 commented Oct 23, 2025

Uh oh!

zminglei commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

zminglei commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

hebiao064 commented Oct 23, 2025

Uh oh!

zminglei commented Oct 23, 2025

Uh oh!

hebiao064 commented Oct 23, 2025

Uh oh!

Uh oh!

Fridge003 commented Oct 23, 2025

Uh oh!

zminglei commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

zminglei commented Oct 23, 2025 •

edited

Loading