Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark by Oseltamivir · Pull Request #1134 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-24T06:59:02Z

Summary

Add dsv4-fp8-mi355x-sglang benchmark config for DeepSeek-V4-Flash FP8 on AMD MI355X with SGLang
New benchmark script benchmarks/single_node/dsv4_fp8_mi355x.sh with DSv4-specific env vars and server args from sgl-project/sglang#23608
Image: lmsysorg/sglang:deepseek-v4-mi350 (custom build from PR branch), Model: sgl-project/DeepSeek-V4-Flash-FP8
Compressed attention backend, DP attention enabled, TP=4/DP=4, CUDA graph disabled
Search space: TP=4, concurrency 4-64 for 1k1k and 8k1k

Test plan

generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml --model-prefix dsv4 generates 10 entries
pytest utils/matrix_logic/ -v — 149/149 tests pass
Verify benchmark runs on MI355X hardware

🤖 Generated with Claude Code

Based on sgl-project/sglang#23608 (AMD DSv4 support). Uses custom docker image lmsysorg/sglang:deepseek-v4-mi350 with compressed attention backend, DP attention, and TP=4/DP=4 configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-24T06:59:10Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T06:59:10Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-04-24T07:06:54Z

+    --dp $TP \
+    --enable-dp-attention \
+    --trust-remote-code \
+    --disable-radix-cache \
+    --attention-backend compressed \
+    --max-running-request 256 \


🟡 Two SGLang flags in the new script use non-canonical abbreviated forms: --dp $TP on line 57 (should be --dp-size) and --max-running-request 256 on line 62 (should be --max-running-requests, plural). Every other SGLang benchmark script in this repo uses the canonical --dp-size/--data-parallel-size and --max-running-requests (e.g. dsr1_fp8_b200.sh, glm5_fp8_b200.sh, qwen3.5_fp8_b200.sh, multi_node/amd_utils/server.sh). These work today only because argparse's default allow_abbrev=True resolves them as unambiguous prefix matches — they would silently break on a future SGLang release that adds another --dp-* flag (e.g. --dp-lb-port) or on a build that sets allow_abbrev=False, either via an "ambiguous option" error at server startup or by not applying the intended 256 running-request cap / DP=TP setting.

Extended reasoning...

The bug

At benchmarks/single_node/dsv4_fp8_mi355x.sh:57 and :62, two SGLang launch_server flags are written in non-canonical abbreviated form:

--tensor-parallel-size $TP \ --dp $TP \ # line 57 — should be --dp-size $TP ... --max-running-request 256 \ # line 62 — should be --max-running-requests 256

SGLang's ServerArgs registers exactly one option for each: --dp-size (aliased as --data-parallel-size) and --max-running-requests (plural). Neither --dp nor --max-running-request is a registered alias. Every other benchmark script in this repo uses the canonical names — multi_node/amd_utils/server.sh uses --dp-size, 25+ single-node scripts use --data-parallel-size, and 30+ scripts use --max-running-requests (plural). Within this very file, line 56 already uses the full --tensor-parallel-size form, so --dp on line 57 is inconsistent even locally.

Why it happens to work today

Python's argparse defaults to allow_abbrev=True. As long as SGLang registers only one option starting with each prefix, argparse resolves --dp → --dp-size and --max-running-request → --max-running-requests as unambiguous prefix matches, and the values are applied. So on the current custom image lmsysorg/sglang:deepseek-v4-mi350 this most likely runs successfully.

Why it is still worth fixing

The other verifier correctly pointed out that the breakage scenarios are hypothetical. That objection is fair as stated — argparse would raise ambiguous option rather than silently pick the wrong flag, and today's behavior is correct. But the script is a new, one-line-each fix and sits on top of a custom fork build from sgl-project/sglang#23608, not a stable release. Two concrete fragilities remain:

Upstream SGLang could add a sibling flag at any time. If a future version introduces any --dp-* option (e.g. --dp-lb-port, --dp-balance-mode) or --max-running-request-* option, argparse will raise ambiguous option: --dp could match --dp-size, --dp-lb-port at server startup, and the benchmark will silently fail to launch. Since this script's image is a PR-branch build that will be rebased onto upstream, this is a real risk window, not purely hypothetical.

A build with allow_abbrev=False would reject both flags. The linked PR #23608 could set this on its parser; if it ever does, the server fails with unrecognized arguments: --dp --max-running-request 256.

Step-by-step proof the canonical forms are required

Grep upstream python/sglang/srt/server_args.py: only --dp-size / --data-parallel-size and --max-running-requests are registered. No --dp alias. No --max-running-request alias.

Grep this repo for --dp (bare): only hit is dsv4_fp8_mi355x.sh:57. All 25+ other scripts use --dp-size or --data-parallel-size.

Grep this repo for --max-running-request (singular): only hit is dsv4_fp8_mi355x.sh:62. All 30+ other scripts use --max-running-requests (plural).

Concrete failure example: if upstream SGLang merges a --dp-lb-port flag next release and this image is rebuilt, running this script produces:
sglang.launch_server: error: ambiguous option: --dp could match --dp-size, --dp-lb-port
and the benchmark aborts before wait_for_server_ready completes.

Fix

Two trivial renames on lines 57 and 62:

- --dp $TP \ + --dp-size $TP \ ... - --max-running-request 256 \ + --max-running-requests 256 \

This brings the script in line with every other benchmark in benchmarks/ and removes both fragilities.

The container image lacks native deepseek_v4 model type registration. Install from huggingface/transformers#45616 (ArthurZucker/add-deepseek-v4) to resolve the KeyError at config loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The container's huggingface_hub rejects rope_theta=10000 (int) when it expects a float. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DeepseekV4Config (from transformers#45616) declares rope_theta as float, but the HF config.json has rope_theta: 10000 (int). huggingface_hub strict dataclass validation rejects the mismatch. Sed-patch the cached config.json after download as a workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Stop upgrading huggingface_hub (breaks hf CLI due to typer incompatibility in the container) - Replace sed with python json round-trip for rope_theta patch to avoid corrupting config.json (sed matched inside larger numbers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The container mounts the HF cache at /mnt/hf_hub_cache/, not the default ~/.cache/huggingface. Use scan_cache_dir() which respects HF_HOME/HF_HUB_CACHE env vars to find the actual snapshot path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of patching config.json (which corrupted JSON and depended on cache path), patch the installed configuration_deepseek_v4.py to widen rope_theta type from float to Union[int, float]. This survives any cache location and avoids touching model files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add --no-deps to pip install to prevent upgrading huggingface_hub (which breaks hf CLI due to typer incompatibility) - Move rope_theta source patch before hf download so the patched config class is in place when AutoConfig loads the model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The rope_theta type patch replaced `float` with `Union[int, float]` but the Union import wasn't reliably added. Prepend import unconditionally. Also removed --no-deps from transformers install to resolve is_offline_mode ImportError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous runs used sed to patch config.json which corrupted numbers like 100000 → 100000.0. The corrupted file persists in /mnt/hf_hub_cache and snapshot_download skips re-downloading it. Validate and remove any corrupt copies so they get freshly downloaded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@strict

Fields hitting strict int/float validation come in sequence: first rope_theta, now compress_rope_theta (value 160000). Rather than playing whack-a-mole, rewrite every `: float =` annotation in the config module to `: Union[int, float] =` so the huggingface_hub @strict validator accepts ints for any of them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch image from lmsysorg/sglang:deepseek-v4-mi350 to lmsysorg/sglang:v0.5.8-rocm700-mi35x-dsk, which is the image named in sgl-project/sglang#23608 and contains the native DSv4 model class under python/sglang/srt/models/deepseek_v4.py. The previous image lacked the native class, causing SGLang to fall back to its generic transformers adapter. That path required a patched transformers branch and hit OOM because it materialized the full MoE on every rank. With the correct image, none of that patching is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The image has PR #23608's SGLang code with a get_config fallback for deepseek_v4, but the fallback hits 'else: raise e' rather than the deepseek_v4 branch (line 300 in the stack). Rather than debug that, pre-empt by editing the cached config.json to set model_type=deepseek_v3. transformers.AutoConfig then accepts it, and SGLang still dispatches by architecture (DeepseekV4ForCausalLM) to its native DSv4 model class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SGLang's internal warmup request was being killed by the scheduler watchdog at 300s because the first forward pass JIT-compiles the fused MoE Triton kernels (no precompiled config for E=256,N=512,dtype=fp8_w8a8 ships in the image). Give it 30 min to finish compiling before the watchdog fires. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kkHuang-amd pushed rocm/sgl-dev:deepseek-v4-mi35x today in response to issues flagged on PR #23608: deepseek_v4 not registered in transformers, and a circular import in flashmla_tests.kernelkit that crashed on first forward pass when SGLANG_HACK_FLASHMLA_BACKEND=torch (our exact config). Keeping the config.json model_type patch as a defensive no-op in case transformers still needs it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Renamed dsv4_fp8_mi355x.sh -> dsv4pro_fp8_mi355x.sh, changed config key dsv4-fp8-mi355x-sglang -> dsv4pro-fp8-mi355x-sglang, model prefix dsv4 -> dsv4pro, model sgl-project/DeepSeek-V4-Flash-FP8 -> sgl-project/DeepSeek-V4-Pro-FP8. Script content is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Match the env vars in dsv4_fp{8,4}_mi355x.sh to the amd/deepseek_v4 branch's python/run_dsv4.sh at SHA 18afbf15: - Add SGLANG_REASONING_EFFORT=max to both scripts (was missing). - Set SGLANG_FORCE_TRITON_MOE_FP8=0 in the FP8 script. The FP4 Models integration commit (sgl-project/sglang@33de1e64, Apr 28) made this the gating switch for aiter MoE dispatch on both paths. Our FP8 script predates the flip (#1134) and was still pinning 1, which forced the triton MoE fallback. Switch dsv4-fp8-mi355x-sglang to deepseek-ai/DeepSeek-V4-Pro to match the canonical checkpoint loaded by every other DSv4 entry and by run_dsv4.sh. Drop the duplicate dsv4-fp8-mi355x-sglang perf-changelog stub left in the worktree (#1160 entry already exists earlier in the file); replace with a real entry covering this PR's image bump, env-var alignment, model swap, and new FP4 variant.

Oseltamivir requested a review from a team April 24, 2026 06:59

Oseltamivir requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners April 24, 2026 06:59

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

Update perf-changelog pr-link to #1134

61b1029

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Oseltamivir added the sweep-enabled label Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

Oseltamivir and others added 17 commits April 24, 2026 00:25

Also upgrade huggingface_hub to fix rope_theta type validation

13d5c72

The container's huggingface_hub rejects rope_theta=10000 (int) when it expects a float. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

import error

4ea3e18

Switch dsv4pro to TP=8

fe6e71d

Revert accidental TP changes outside dsv4pro

fe7f870

Oseltamivir added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

Oseltamivir and others added 2 commits April 24, 2026 09:51

Pin max-total-tokens and raise mem-fraction-static

0139a4f

Merge branch 'main' into dsv4-fp8-mi355x-sglang

48ff44e

Oseltamivir changed the title ~~Add DeepSeek-V4-Flash FP8 MI355X SGLang benchmark~~ Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark Apr 24, 2026

Oseltamivir added 2 commits April 24, 2026 11:17

Size max-total-tokens for concurrent requests

d39d6ee

Rename dsv4pro to dsv4; bump mem-fraction-static to 0.95

8c32d73

Oseltamivir added sweep-enabled and removed full-sweep-enabled labels Apr 24, 2026

Oseltamivir force-pushed the dsv4-fp8-mi355x-sglang branch from d34eb60 to 8c32d73 Compare April 24, 2026 20:33

Oseltamivir and others added 3 commits April 24, 2026 14:27

oom

8e6f48f

Merge branch 'main' into dsv4-fp8-mi355x-sglang

78f0306

Update benchmark description for DeepSeek-V4-Pro

6bd5383

Oseltamivir merged commit 32a1de4 into main Apr 24, 2026
2 checks passed

Oseltamivir deleted the dsv4-fp8-mi355x-sglang branch April 24, 2026 23:51

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 24, 2026

Oseltamivir mentioned this pull request Apr 29, 2026

mi355x dsv4 retry #1160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark#1134

Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark#1134
Oseltamivir merged 27 commits intomainfrom
dsv4-fp8-mi355x-sglang

Oseltamivir commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 24, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

The bug

Why it happens to work today

Why it is still worth fixing

Step-by-step proof the canonical forms are required

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant