Add DeepSeek-V4-Pro FP8 MI355X SGLang benchmark#1134
Conversation
Based on sgl-project/sglang#23608 (AMD DSv4 support). Uses custom docker image lmsysorg/sglang:deepseek-v4-mi350 with compressed attention backend, DP attention, and TP=4/DP=4 configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| --dp $TP \ | ||
| --enable-dp-attention \ | ||
| --trust-remote-code \ | ||
| --disable-radix-cache \ | ||
| --attention-backend compressed \ | ||
| --max-running-request 256 \ |
There was a problem hiding this comment.
🟡 Two SGLang flags in the new script use non-canonical abbreviated forms: --dp $TP on line 57 (should be --dp-size) and --max-running-request 256 on line 62 (should be --max-running-requests, plural). Every other SGLang benchmark script in this repo uses the canonical --dp-size/--data-parallel-size and --max-running-requests (e.g. dsr1_fp8_b200.sh, glm5_fp8_b200.sh, qwen3.5_fp8_b200.sh, multi_node/amd_utils/server.sh). These work today only because argparse's default allow_abbrev=True resolves them as unambiguous prefix matches — they would silently break on a future SGLang release that adds another --dp-* flag (e.g. --dp-lb-port) or on a build that sets allow_abbrev=False, either via an "ambiguous option" error at server startup or by not applying the intended 256 running-request cap / DP=TP setting.
Extended reasoning...
The bug
At benchmarks/single_node/dsv4_fp8_mi355x.sh:57 and :62, two SGLang launch_server flags are written in non-canonical abbreviated form:
--tensor-parallel-size $TP \
--dp $TP \ # line 57 — should be --dp-size $TP
...
--max-running-request 256 \ # line 62 — should be --max-running-requests 256SGLang's ServerArgs registers exactly one option for each: --dp-size (aliased as --data-parallel-size) and --max-running-requests (plural). Neither --dp nor --max-running-request is a registered alias. Every other benchmark script in this repo uses the canonical names — multi_node/amd_utils/server.sh uses --dp-size, 25+ single-node scripts use --data-parallel-size, and 30+ scripts use --max-running-requests (plural). Within this very file, line 56 already uses the full --tensor-parallel-size form, so --dp on line 57 is inconsistent even locally.
Why it happens to work today
Python's argparse defaults to allow_abbrev=True. As long as SGLang registers only one option starting with each prefix, argparse resolves --dp → --dp-size and --max-running-request → --max-running-requests as unambiguous prefix matches, and the values are applied. So on the current custom image lmsysorg/sglang:deepseek-v4-mi350 this most likely runs successfully.
Why it is still worth fixing
The other verifier correctly pointed out that the breakage scenarios are hypothetical. That objection is fair as stated — argparse would raise ambiguous option rather than silently pick the wrong flag, and today's behavior is correct. But the script is a new, one-line-each fix and sits on top of a custom fork build from sgl-project/sglang#23608, not a stable release. Two concrete fragilities remain:
- Upstream SGLang could add a sibling flag at any time. If a future version introduces any
--dp-*option (e.g.--dp-lb-port,--dp-balance-mode) or--max-running-request-*option, argparse will raiseambiguous option: --dp could match --dp-size, --dp-lb-portat server startup, and the benchmark will silently fail to launch. Since this script's image is a PR-branch build that will be rebased onto upstream, this is a real risk window, not purely hypothetical. - A build with
allow_abbrev=Falsewould reject both flags. The linked PR #23608 could set this on its parser; if it ever does, the server fails withunrecognized arguments: --dp --max-running-request 256.
Step-by-step proof the canonical forms are required
- Grep upstream
python/sglang/srt/server_args.py: only--dp-size/--data-parallel-sizeand--max-running-requestsare registered. No--dpalias. No--max-running-requestalias. - Grep this repo for
--dp(bare): only hit isdsv4_fp8_mi355x.sh:57. All 25+ other scripts use--dp-sizeor--data-parallel-size. - Grep this repo for
--max-running-request(singular): only hit isdsv4_fp8_mi355x.sh:62. All 30+ other scripts use--max-running-requests(plural). - Concrete failure example: if upstream SGLang merges a
--dp-lb-portflag next release and this image is rebuilt, running this script produces:and the benchmark aborts beforesglang.launch_server: error: ambiguous option: --dp could match --dp-size, --dp-lb-portwait_for_server_readycompletes.
Fix
Two trivial renames on lines 57 and 62:
- --dp $TP \
+ --dp-size $TP \
...
- --max-running-request 256 \
+ --max-running-requests 256 \This brings the script in line with every other benchmark in benchmarks/ and removes both fragilities.
The container image lacks native deepseek_v4 model type registration. Install from huggingface/transformers#45616 (ArthurZucker/add-deepseek-v4) to resolve the KeyError at config loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container's huggingface_hub rejects rope_theta=10000 (int) when it expects a float. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DeepseekV4Config (from transformers#45616) declares rope_theta as float, but the HF config.json has rope_theta: 10000 (int). huggingface_hub strict dataclass validation rejects the mismatch. Sed-patch the cached config.json after download as a workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Stop upgrading huggingface_hub (breaks hf CLI due to typer incompatibility in the container) - Replace sed with python json round-trip for rope_theta patch to avoid corrupting config.json (sed matched inside larger numbers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The container mounts the HF cache at /mnt/hf_hub_cache/, not the default ~/.cache/huggingface. Use scan_cache_dir() which respects HF_HOME/HF_HUB_CACHE env vars to find the actual snapshot path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of patching config.json (which corrupted JSON and depended on cache path), patch the installed configuration_deepseek_v4.py to widen rope_theta type from float to Union[int, float]. This survives any cache location and avoids touching model files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --no-deps to pip install to prevent upgrading huggingface_hub (which breaks hf CLI due to typer incompatibility) - Move rope_theta source patch before hf download so the patched config class is in place when AutoConfig loads the model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The rope_theta type patch replaced `float` with `Union[int, float]` but the Union import wasn't reliably added. Prepend import unconditionally. Also removed --no-deps from transformers install to resolve is_offline_mode ImportError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous runs used sed to patch config.json which corrupted numbers like 100000 → 100000.0. The corrupted file persists in /mnt/hf_hub_cache and snapshot_download skips re-downloading it. Validate and remove any corrupt copies so they get freshly downloaded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fields hitting strict int/float validation come in sequence: first rope_theta, now compress_rope_theta (value 160000). Rather than playing whack-a-mole, rewrite every `: float =` annotation in the config module to `: Union[int, float] =` so the huggingface_hub @strict validator accepts ints for any of them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch image from lmsysorg/sglang:deepseek-v4-mi350 to lmsysorg/sglang:v0.5.8-rocm700-mi35x-dsk, which is the image named in sgl-project/sglang#23608 and contains the native DSv4 model class under python/sglang/srt/models/deepseek_v4.py. The previous image lacked the native class, causing SGLang to fall back to its generic transformers adapter. That path required a patched transformers branch and hit OOM because it materialized the full MoE on every rank. With the correct image, none of that patching is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The image has PR #23608's SGLang code with a get_config fallback for deepseek_v4, but the fallback hits 'else: raise e' rather than the deepseek_v4 branch (line 300 in the stack). Rather than debug that, pre-empt by editing the cached config.json to set model_type=deepseek_v3. transformers.AutoConfig then accepts it, and SGLang still dispatches by architecture (DeepseekV4ForCausalLM) to its native DSv4 model class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SGLang's internal warmup request was being killed by the scheduler watchdog at 300s because the first forward pass JIT-compiles the fused MoE Triton kernels (no precompiled config for E=256,N=512,dtype=fp8_w8a8 ships in the image). Give it 30 min to finish compiling before the watchdog fires. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kkHuang-amd pushed rocm/sgl-dev:deepseek-v4-mi35x today in response to issues flagged on PR #23608: deepseek_v4 not registered in transformers, and a circular import in flashmla_tests.kernelkit that crashed on first forward pass when SGLANG_HACK_FLASHMLA_BACKEND=torch (our exact config). Keeping the config.json model_type patch as a defensive no-op in case transformers still needs it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Renamed dsv4_fp8_mi355x.sh -> dsv4pro_fp8_mi355x.sh, changed config key dsv4-fp8-mi355x-sglang -> dsv4pro-fp8-mi355x-sglang, model prefix dsv4 -> dsv4pro, model sgl-project/DeepSeek-V4-Flash-FP8 -> sgl-project/DeepSeek-V4-Pro-FP8. Script content is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d34eb60 to
8c32d73
Compare
Match the env vars in dsv4_fp{8,4}_mi355x.sh to the amd/deepseek_v4
branch's python/run_dsv4.sh at SHA 18afbf15:
- Add SGLANG_REASONING_EFFORT=max to both scripts (was missing).
- Set SGLANG_FORCE_TRITON_MOE_FP8=0 in the FP8 script. The FP4 Models
integration commit (sgl-project/sglang@33de1e64, Apr 28) made this the
gating switch for aiter MoE dispatch on both paths. Our FP8 script
predates the flip (#1134) and was still pinning 1, which forced the
triton MoE fallback.
Switch dsv4-fp8-mi355x-sglang to deepseek-ai/DeepSeek-V4-Pro to match the
canonical checkpoint loaded by every other DSv4 entry and by run_dsv4.sh.
Drop the duplicate dsv4-fp8-mi355x-sglang perf-changelog stub left in
the worktree (#1160 entry already exists earlier in the file); replace
with a real entry covering this PR's image bump, env-var alignment,
model swap, and new FP4 variant.
Summary
dsv4-fp8-mi355x-sglangbenchmark config for DeepSeek-V4-Flash FP8 on AMD MI355X with SGLangbenchmarks/single_node/dsv4_fp8_mi355x.shwith DSv4-specific env vars and server args from sgl-project/sglang#23608lmsysorg/sglang:deepseek-v4-mi350(custom build from PR branch), Model:sgl-project/DeepSeek-V4-Flash-FP8Test plan
generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml --model-prefix dsv4generates 10 entriespytest utils/matrix_logic/ -v— 149/149 tests pass🤖 Generated with Claude Code