Skip to content

Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark#1134

Merged
Oseltamivir merged 27 commits intomainfrom
dsv4-fp8-mi355x-sglang
Apr 24, 2026
Merged

Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark#1134
Oseltamivir merged 27 commits intomainfrom
dsv4-fp8-mi355x-sglang

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Add dsv4-fp8-mi355x-sglang benchmark config for DeepSeek-V4-Flash FP8 on AMD MI355X with SGLang
  • New benchmark script benchmarks/single_node/dsv4_fp8_mi355x.sh with DSv4-specific env vars and server args from sgl-project/sglang#23608
  • Image: lmsysorg/sglang:deepseek-v4-mi350 (custom build from PR branch), Model: sgl-project/DeepSeek-V4-Flash-FP8
  • Compressed attention backend, DP attention enabled, TP=4/DP=4, CUDA graph disabled
  • Search space: TP=4, concurrency 4-64 for 1k1k and 8k1k

Test plan

  • generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml --model-prefix dsv4 generates 10 entries
  • pytest utils/matrix_logic/ -v — 149/149 tests pass
  • Verify benchmark runs on MI355X hardware

🤖 Generated with Claude Code

Based on sgl-project/sglang#23608 (AMD DSv4 support). Uses custom
docker image lmsysorg/sglang:deepseek-v4-mi350 with compressed
attention backend, DP attention, and TP=4/DP=4 configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +57 to +62
--dp $TP \
--enable-dp-attention \
--trust-remote-code \
--disable-radix-cache \
--attention-backend compressed \
--max-running-request 256 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two SGLang flags in the new script use non-canonical abbreviated forms: --dp $TP on line 57 (should be --dp-size) and --max-running-request 256 on line 62 (should be --max-running-requests, plural). Every other SGLang benchmark script in this repo uses the canonical --dp-size/--data-parallel-size and --max-running-requests (e.g. dsr1_fp8_b200.sh, glm5_fp8_b200.sh, qwen3.5_fp8_b200.sh, multi_node/amd_utils/server.sh). These work today only because argparse's default allow_abbrev=True resolves them as unambiguous prefix matches — they would silently break on a future SGLang release that adds another --dp-* flag (e.g. --dp-lb-port) or on a build that sets allow_abbrev=False, either via an "ambiguous option" error at server startup or by not applying the intended 256 running-request cap / DP=TP setting.

Extended reasoning...

The bug

At benchmarks/single_node/dsv4_fp8_mi355x.sh:57 and :62, two SGLang launch_server flags are written in non-canonical abbreviated form:

--tensor-parallel-size $TP \
--dp $TP \                    # line 57 — should be --dp-size $TP
...
--max-running-request 256 \    # line 62 — should be --max-running-requests 256

SGLang's ServerArgs registers exactly one option for each: --dp-size (aliased as --data-parallel-size) and --max-running-requests (plural). Neither --dp nor --max-running-request is a registered alias. Every other benchmark script in this repo uses the canonical names — multi_node/amd_utils/server.sh uses --dp-size, 25+ single-node scripts use --data-parallel-size, and 30+ scripts use --max-running-requests (plural). Within this very file, line 56 already uses the full --tensor-parallel-size form, so --dp on line 57 is inconsistent even locally.

Why it happens to work today

Python's argparse defaults to allow_abbrev=True. As long as SGLang registers only one option starting with each prefix, argparse resolves --dp--dp-size and --max-running-request--max-running-requests as unambiguous prefix matches, and the values are applied. So on the current custom image lmsysorg/sglang:deepseek-v4-mi350 this most likely runs successfully.

Why it is still worth fixing

The other verifier correctly pointed out that the breakage scenarios are hypothetical. That objection is fair as stated — argparse would raise ambiguous option rather than silently pick the wrong flag, and today's behavior is correct. But the script is a new, one-line-each fix and sits on top of a custom fork build from sgl-project/sglang#23608, not a stable release. Two concrete fragilities remain:

  1. Upstream SGLang could add a sibling flag at any time. If a future version introduces any --dp-* option (e.g. --dp-lb-port, --dp-balance-mode) or --max-running-request-* option, argparse will raise ambiguous option: --dp could match --dp-size, --dp-lb-port at server startup, and the benchmark will silently fail to launch. Since this script's image is a PR-branch build that will be rebased onto upstream, this is a real risk window, not purely hypothetical.
  2. A build with allow_abbrev=False would reject both flags. The linked PR #23608 could set this on its parser; if it ever does, the server fails with unrecognized arguments: --dp --max-running-request 256.

Step-by-step proof the canonical forms are required

  1. Grep upstream python/sglang/srt/server_args.py: only --dp-size / --data-parallel-size and --max-running-requests are registered. No --dp alias. No --max-running-request alias.
  2. Grep this repo for --dp (bare): only hit is dsv4_fp8_mi355x.sh:57. All 25+ other scripts use --dp-size or --data-parallel-size.
  3. Grep this repo for --max-running-request (singular): only hit is dsv4_fp8_mi355x.sh:62. All 30+ other scripts use --max-running-requests (plural).
  4. Concrete failure example: if upstream SGLang merges a --dp-lb-port flag next release and this image is rebuilt, running this script produces:
    sglang.launch_server: error: ambiguous option: --dp could match --dp-size, --dp-lb-port
    
    and the benchmark aborts before wait_for_server_ready completes.

Fix

Two trivial renames on lines 57 and 62:

-    --dp $TP \
+    --dp-size $TP \
...
-    --max-running-request 256 \
+    --max-running-requests 256 \

This brings the script in line with every other benchmark in benchmarks/ and removes both fragilities.

Oseltamivir and others added 17 commits April 24, 2026 00:25
The container image lacks native deepseek_v4 model type registration.
Install from huggingface/transformers#45616 (ArthurZucker/add-deepseek-v4)
to resolve the KeyError at config loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container's huggingface_hub rejects rope_theta=10000 (int) when
it expects a float.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DeepseekV4Config (from transformers#45616) declares rope_theta as
float, but the HF config.json has rope_theta: 10000 (int).
huggingface_hub strict dataclass validation rejects the mismatch.
Sed-patch the cached config.json after download as a workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Stop upgrading huggingface_hub (breaks hf CLI due to typer
  incompatibility in the container)
- Replace sed with python json round-trip for rope_theta patch
  to avoid corrupting config.json (sed matched inside larger numbers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container mounts the HF cache at /mnt/hf_hub_cache/, not the
default ~/.cache/huggingface. Use scan_cache_dir() which respects
HF_HOME/HF_HUB_CACHE env vars to find the actual snapshot path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of patching config.json (which corrupted JSON and depended on
cache path), patch the installed configuration_deepseek_v4.py to widen
rope_theta type from float to Union[int, float]. This survives any
cache location and avoids touching model files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --no-deps to pip install to prevent upgrading huggingface_hub
  (which breaks hf CLI due to typer incompatibility)
- Move rope_theta source patch before hf download so the patched
  config class is in place when AutoConfig loads the model

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The rope_theta type patch replaced `float` with `Union[int, float]`
but the Union import wasn't reliably added. Prepend import
unconditionally. Also removed --no-deps from transformers install
to resolve is_offline_mode ImportError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous runs used sed to patch config.json which corrupted numbers
like 100000 → 100000.0. The corrupted file persists in /mnt/hf_hub_cache
and snapshot_download skips re-downloading it. Validate and remove
any corrupt copies so they get freshly downloaded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fields hitting strict int/float validation come in sequence: first
rope_theta, now compress_rope_theta (value 160000). Rather than
playing whack-a-mole, rewrite every `: float =` annotation in the
config module to `: Union[int, float] =` so the huggingface_hub
@strict validator accepts ints for any of them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch image from lmsysorg/sglang:deepseek-v4-mi350 to
lmsysorg/sglang:v0.5.8-rocm700-mi35x-dsk, which is the image
named in sgl-project/sglang#23608 and contains the native DSv4
model class under python/sglang/srt/models/deepseek_v4.py.

The previous image lacked the native class, causing SGLang to
fall back to its generic transformers adapter. That path
required a patched transformers branch and hit OOM because it
materialized the full MoE on every rank. With the correct image,
none of that patching is needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The image has PR #23608's SGLang code with a get_config fallback
for deepseek_v4, but the fallback hits 'else: raise e' rather than
the deepseek_v4 branch (line 300 in the stack). Rather than debug
that, pre-empt by editing the cached config.json to set
model_type=deepseek_v3. transformers.AutoConfig then accepts it,
and SGLang still dispatches by architecture (DeepseekV4ForCausalLM)
to its native DSv4 model class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SGLang's internal warmup request was being killed by the scheduler
watchdog at 300s because the first forward pass JIT-compiles the
fused MoE Triton kernels (no precompiled config for
E=256,N=512,dtype=fp8_w8a8 ships in the image). Give it 30 min
to finish compiling before the watchdog fires.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kkHuang-amd pushed rocm/sgl-dev:deepseek-v4-mi35x today in response
to issues flagged on PR #23608: deepseek_v4 not registered in
transformers, and a circular import in flashmla_tests.kernelkit
that crashed on first forward pass when SGLANG_HACK_FLASHMLA_BACKEND=torch
(our exact config). Keeping the config.json model_type patch as a
defensive no-op in case transformers still needs it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Renamed dsv4_fp8_mi355x.sh -> dsv4pro_fp8_mi355x.sh, changed config
key dsv4-fp8-mi355x-sglang -> dsv4pro-fp8-mi355x-sglang, model prefix
dsv4 -> dsv4pro, model sgl-project/DeepSeek-V4-Flash-FP8 ->
sgl-project/DeepSeek-V4-Pro-FP8. Script content is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Oseltamivir Oseltamivir changed the title Add DeepSeek-V4-Flash FP8 MI355X SGLang benchmark Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark Apr 24, 2026
@Oseltamivir Oseltamivir merged commit 32a1de4 into main Apr 24, 2026
2 checks passed
@Oseltamivir Oseltamivir deleted the dsv4-fp8-mi355x-sglang branch April 24, 2026 23:51
@Oseltamivir Oseltamivir mentioned this pull request Apr 29, 2026
Oseltamivir added a commit that referenced this pull request Apr 29, 2026
Match the env vars in dsv4_fp{8,4}_mi355x.sh to the amd/deepseek_v4
branch's python/run_dsv4.sh at SHA 18afbf15:

- Add SGLANG_REASONING_EFFORT=max to both scripts (was missing).
- Set SGLANG_FORCE_TRITON_MOE_FP8=0 in the FP8 script. The FP4 Models
  integration commit (sgl-project/sglang@33de1e64, Apr 28) made this the
  gating switch for aiter MoE dispatch on both paths. Our FP8 script
  predates the flip (#1134) and was still pinning 1, which forced the
  triton MoE fallback.

Switch dsv4-fp8-mi355x-sglang to deepseek-ai/DeepSeek-V4-Pro to match the
canonical checkpoint loaded by every other DSv4 entry and by run_dsv4.sh.

Drop the duplicate dsv4-fp8-mi355x-sglang perf-changelog stub left in
the worktree (#1160 entry already exists earlier in the file); replace
with a real entry covering this PR's image bump, env-var alignment,
model swap, and new FP4 variant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant