Skip to content

[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU#12000

Merged
Fridge003 merged 14 commits intosgl-project:mainfrom
zminglei:dpsk-deterministic
Oct 24, 2025
Merged

[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU#12000
Fridge003 merged 14 commits intosgl-project:mainfrom
zminglei:dpsk-deterministic

Conversation

@zminglei
Copy link
Collaborator

@zminglei zminglei commented Oct 23, 2025

Motivation

Part of this Issue: #10278

As part of deepseek deterministic inference support, this change ensures deterministic inference results for deepseek arch models on a single GPU.

Modifications

  1. Fixed the AttnForwardMethod as MLA instead of determining it at runtime based on batch status.
  2. Replace torch.bmm with batch_invariant_bmm when enable deterministic inference.

Currently only support fa3 and triton, will follow up to support flashinfer backend as well later.

Accuracy Tests

  1. Test with sglang-ci-dsv3-test (deepseek_v3 fp8) model on a H100

FA3 Backend

Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.660
Invalid: 0.000
Latency: 11.559 s
Output throughput: 1515.855 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 298, Unique samples: 2
Prompt 1 with prefix length 511: total samples: 345, Unique samples: 8
Prompt 2 with prefix length 2048: total samples: 324, Unique samples: 4
Prompt 3 with prefix length 4097: total samples: 308, Unique samples: 8

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
200/200 [00:10<00:00, 19.36it/s]
Accuracy: 0.655
Invalid: 0.000
Latency: 16.573 s
Output throughput: 1067.717 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 355, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 291, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 292, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 337, Unique samples: 1

# Server log shows radix cache is been used
[2025-10-23 09:38:07] Prefill batch. #new-seq: 1, #new-token: 776, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-23 09:38:07] Prefill batch. #new-seq: 23, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 11,
[2025-10-23 09:38:07] Prefill batch. #new-seq: 12, #new-token: 489, #cached-token: 3045, token usage: 0.01, #running-req: 23, #queue-req: 0,

Triton Backend

Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.655
Invalid: 0.000
Latency: 14.259 s
Output throughput: 1310.498 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 308, Unique samples: 2
Prompt 1 with prefix length 511: total samples: 294, Unique samples: 3
Prompt 2 with prefix length 2048: total samples: 314, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 359, Unique samples: 5

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.630
Invalid: 0.000
Latency: 34.322 s
Output throughput: 517.445 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 300, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 299, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 332, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 344, Unique samples: 1

# Server log shows radix cache is been used
[2025-10-23 09:45:53] Prefill batch. #new-seq: 1, #new-token: 776, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-23 09:45:54] Prefill batch. #new-seq: 21, #new-token: 8002, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 23,
[2025-10-23 09:45:54] Prefill batch. #new-seq: 23, #new-token: 23, #cached-token: 5952, token usage: 0.01, #running-req: 22, #queue-req: 0,
  1. Test with DeepSeek-Coder-V2-Lite-Instruct (deepseek_v2 arch) model on a H100

FA3 Backend
Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
200/200 [00:10<00:00, 19.36it/s]
Accuracy: 0.830
Invalid: 0.000
Latency: 10.394 s
Output throughput: 2498.193 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 317, Unique samples: 3
Prompt 1 with prefix length 511: total samples: 307, Unique samples: 10
Prompt 2 with prefix length 2048: total samples: 320, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 331, Unique samples: 17

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
200/200 [00:10<00:00, 19.36it/s]
Accuracy: 0.830
Invalid: 0.000
Latency: 14.150 s
Output throughput: 1844.425 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 355, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 291, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 292, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 337, Unique samples: 1

Triton Backend
Without deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.805
Invalid: 0.000
Latency: 11.297 s
Output throughput: 2276.343 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 323, Unique samples: 3
Prompt 1 with prefix length 511: total samples: 337, Unique samples: 9
Prompt 2 with prefix length 2048: total samples: 296, Unique samples: 5
Prompt 3 with prefix length 4097: total samples: 319, Unique samples: 13

Enable deterministic

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.800
Invalid: 0.000
Latency: 16.252 s
Output throughput: 1609.897 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix
Prompt 0 with prefix length 1: total samples: 300, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 299, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 332, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 344, Unique samples: 1

Benchmarking and Profiling

Checklist

@zminglei zminglei changed the title support deterministic for deepseek model support deterministic inference for deepseek model Oct 23, 2025
@zminglei zminglei marked this pull request as ready for review October 23, 2025 05:31
@hebiao064
Copy link
Collaborator

can you add server log to demonstrate radix cache is being used?

@zminglei
Copy link
Collaborator Author

can you add server log to demonstrate radix cache is being used?

Sure, I added server log for sglang-ci-dsv3-test to demonstrate the radix cache is being used when enable deterministic.

@hebiao064
Copy link
Collaborator

I made a minor change on the det unit test: #12022 This change allow subclasses to easily override the test model.

after this pr merged and you validated dsv3 works with 12000, pls add dsv3-test into unit test

@Fridge003
Copy link
Collaborator

@zminglei Can we handle the default attention backend for dpsk here?
https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py#L1517
Since flashinfer is currenlty not supported for dpsk, but it will be set by default on Blackwell

@zminglei
Copy link
Collaborator Author

@zminglei Can we handle the default attention backend for dpsk here? https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py#L1517 Since flashinfer is currenlty not supported for dpsk, but it will be set by default on Blackwell
updated it, pls take another look

@hebiao064 hebiao064 changed the title support deterministic inference for deepseek model support deterministic inference for deepseek model on hopper Oct 24, 2025
@zminglei zminglei changed the title support deterministic inference for deepseek model on hopper [1/N] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU Oct 24, 2025
@zminglei zminglei changed the title [1/N] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU [1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU Oct 24, 2025
@Fridge003 Fridge003 merged commit f4b78d1 into sgl-project:main Oct 24, 2025
100 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments