SGL DSV4 MI355x by Oseltamivir · Pull Request #1231 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-29T21:23:52Z

No description provided.

github-actions · 2026-04-29T21:24:00Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-29T21:31:16Z

+(
+    cd "$SGL_PR_DIR"
+    git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \
+        || git fetch --depth=1 origin amd/deepseek_v4
+    git checkout --force "$SGL_PR_SHA"
+    test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"
+
+    # Reinstall just the Python package; the image already has the ROCm
+    # kernel deps (aiter, triton, tilelang, torch) at versions matched to
+    # this branch, so --no-deps avoids pip resolving them against PyPI.
+    pip install --no-build-isolation --no-deps --force-reinstall -e python/
+)


🔴 The new sglang overlay block (lines 20-41) does not actually enforce the 'reproducible pin per benchmark run' the comment promises: the fallback git fetch --depth=1 origin amd/deepseek_v4 only retrieves the branch tip, so when SGL_PR_SHA lags the branch (the exact case the 'Bump SGL_PR_SHA when the branch advances' comment anticipates) and the by-SHA fetch fails, the subsequent git checkout --force "$SGL_PR_SHA" cannot resolve the SHA. Combined with the missing set -eo pipefail (the sister script dsv4_fp8_mi355x_vllm.sh:2 has it), the failed checkout, the test $(git rev-parse HEAD) = $SGL_PR_SHA pin guard at line 35, and any other failure are all swallowed — pip install -e python/ runs unconditionally against whatever was previously checked out, and benchmark numbers get reported against the wrong sglang. Fix: add set -eo pipefail near the top, and either drop the --depth=1 fallback or fetch amd/deepseek_v4 with deeper history (e.g. --depth=50) so older pinned SHAs remain reachable.

Extended reasoning...

Bug

The overlay block added at lines 20-41 promises a "reproducible pin per benchmark run" (see comment lines 20-23), but the control flow does not enforce it. Two compounding issues silently install the wrong sglang on a fetch/pin mismatch:

1. Shallow-fetch fallback cannot resolve a non-tip pin (lines 32-33). The clone at line 26 uses --filter=blob:none and clones only the default branch, which is not amd/deepseek_v4. The subshell then runs:

git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \ || git fetch --depth=1 origin amd/deepseek_v4

The fallback retrieves a single commit — the current tip of amd/deepseek_v4. The very next line, git checkout --force "$SGL_PR_SHA", then fails with reference is not a tree whenever SGL_PR_SHA is not that tip. The script's own comment "Bump SGL_PR_SHA when the branch advances" explicitly anticipates that the pinned SHA lags the branch tip, so this is not a hypothetical case. The fetch-by-SHA can fail for ordinary reasons (transient GitHub error, unauthenticated rate limit, the SHA being GC'd after a force-push), and the fallback then cannot produce the requested object.

2. No set -eo pipefail (line 1 is the shebang, line 3 sources benchmark_lib.sh with no shell-options line in between). The sister script benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh:2 sets set -eo pipefail, and the sourced benchmark_lib.sh does not enable errexit either. Without it, every failure inside the subshell on lines 30-41 is swallowed:

the failed git checkout --force "$SGL_PR_SHA" does not abort the subshell;

test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" at line 35 — the explicit pin guard — only sets $?, it does not exit;

pip install --no-build-isolation --no-deps --force-reinstall -e python/ at line 40 then runs unconditionally, against whatever the working tree happens to contain, and its exit code becomes the subshell's, masking every earlier failure;

the parent script also lacks errexit, so even a non-zero subshell exit would be ignored.

The python3 -c "import sglang; print(...)" line on 43 only prints the version, it does not assert anything, and sglang.launch_server then runs against the wrong sglang.

Step-by-step proof

Assume SGL_PR_SHA = 18afbf15… and the branch amd/deepseek_v4 has since advanced to a different SHA Y. The fetch-by-SHA call fails (transient 503 from github.com):

Line 25-27: clone succeeds, default branch only (no amd/deepseek_v4, no 18afbf15…).

Line 32: git fetch --depth=1 origin 18afbf15… → fails (stderr suppressed via 2>/dev/null), exit 128.

Line 33: git fetch --depth=1 origin amd/deepseek_v4 → succeeds, downloads only commit Y.

Line 34: git checkout --force 18afbf15… → fails: fatal: reference is not a tree: 18afbf15…. Without set -e, subshell continues; HEAD is still whatever the initial clone left (default branch tip).

Line 35: test "$(git rev-parse HEAD)" = "18afbf15…" → $? set non-zero. Subshell continues.

Line 40: pip install … -e python/ → runs against the default-branch tree (not the pin), succeeds, exit 0.

Subshell exits 0 (pip's status). Parent continues.

Line 43: import sglang; print(...) → prints the wrong version.

sglang.launch_server runs with the wrong sglang and benchmark numbers are reported under the assumed pin.

A second realistic trigger: the by-SHA fetch succeeds on first run, then on a later invocation in the same /tmp/sglang-amd-dsv4 checkout the network is flaky. The if [ ! -d "$SGL_PR_DIR/.git" ] guard at line 25 means the clone is reused; the fallback then leaves the local repo with only commit Y; the same silent cascade follows.

Fix

Minimum: add set -eo pipefail near the top of the script (matching the vllm sister script). That alone causes the failed checkout / mismatched-HEAD test to abort the subshell and the script.

Additionally, make the fallback actually able to produce non-tip SHAs — drop --depth=1 from the branch fallback, or use a deeper fetch (e.g. --depth=50) so older pinned SHAs remain reachable.

Switch the FP8 entry to rocm/sgl-dev:rocm720-deepseek-v4-mi35x (Apr 28 ROCm 7.2.0 image), aligned with the amd/deepseek_v4 SHA pinned by the overlay in dsv4_fp8_mi355x.sh. Add dsv4_fp4_mi355x_sglang.sh + dsv4-fp4-mi355x-sglang config entry, mirroring the branch's run_dsv4.sh FP4 path: SGLANG_DSV4_FP4_EXPERTS=True, SGLANG_FORCE_TRITON_MOE_FP8=0, model deepseek-ai/DeepSeek-V4-Pro.

Match the env vars in dsv4_fp{8,4}_mi355x.sh to the amd/deepseek_v4 branch's python/run_dsv4.sh at SHA 18afbf15: - Add SGLANG_REASONING_EFFORT=max to both scripts (was missing). - Set SGLANG_FORCE_TRITON_MOE_FP8=0 in the FP8 script. The FP4 Models integration commit (sgl-project/sglang@33de1e64, Apr 28) made this the gating switch for aiter MoE dispatch on both paths. Our FP8 script predates the flip (#1134) and was still pinning 1, which forced the triton MoE fallback. Switch dsv4-fp8-mi355x-sglang to deepseek-ai/DeepSeek-V4-Pro to match the canonical checkpoint loaded by every other DSv4 entry and by run_dsv4.sh. Drop the duplicate dsv4-fp8-mi355x-sglang perf-changelog stub left in the worktree (#1160 entry already exists earlier in the file); replace with a real entry covering this PR's image bump, env-var alignment, model swap, and new FP4 variant.

rocm720-deepseek-v4-mi35x crashes on global_tokens.fill_(0) inside _dp_gather_via_all_reduce during forward_idle: torch.AcceleratorError: HIP error: invalid configuration argument File ".../layers/dp_attention.py", line 454, in _dp_gather_via_all_reduce global_tokens.fill_(0) global_tokens comes from get_global_dp_buffer(), which allocates inside use_symmetric_memory(). The SGLANG_USE_ROCM700A workaround only swaps the cuda-graph default padding mode to SUM_LEN; with --disable-cuda-graph we take the eager path and still allocate via symmetric memory, hitting the RCCL symmetric-memory bug the workaround was created for. Switch both dsv4-fp{8,4}-mi355x-sglang entries to v0.5.10.post1-rocm700-mi35x-20260428 (Apr 28 ROCm 7.0.0a tag) to keep the sglang overlay perf changes shippable. Revisit rocm720 once the WA is extended to cover eager mode or RCCL symm-mem lands.

SGL DSV4 MI355x

1c6d5b1

Oseltamivir requested a review from a team April 29, 2026 21:23

github-project-automation Bot added this to InferenceMAX Board Apr 29, 2026

claude Bot reviewed Apr 29, 2026

View reviewed changes

Oseltamivir requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners April 29, 2026 21:38

Oseltamivir added the sweep-enabled label Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGL DSV4 MI355x#1231

SGL DSV4 MI355x#1231
Oseltamivir wants to merge 4 commits intomainfrom
sgl-dsv4-mi355x

Oseltamivir commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

claude Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

claude Bot Apr 29, 2026

Choose a reason for hiding this comment

Bug

Step-by-step proof

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant