Skip to content

SGL DSV4 MI355x#1231

Open
Oseltamivir wants to merge 4 commits intomainfrom
sgl-dsv4-mi355x
Open

SGL DSV4 MI355x#1231
Oseltamivir wants to merge 4 commits intomainfrom
sgl-dsv4-mi355x

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +30 to +41
(
cd "$SGL_PR_DIR"
git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \
|| git fetch --depth=1 origin amd/deepseek_v4
git checkout --force "$SGL_PR_SHA"
test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"

# Reinstall just the Python package; the image already has the ROCm
# kernel deps (aiter, triton, tilelang, torch) at versions matched to
# this branch, so --no-deps avoids pip resolving them against PyPI.
pip install --no-build-isolation --no-deps --force-reinstall -e python/
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new sglang overlay block (lines 20-41) does not actually enforce the 'reproducible pin per benchmark run' the comment promises: the fallback git fetch --depth=1 origin amd/deepseek_v4 only retrieves the branch tip, so when SGL_PR_SHA lags the branch (the exact case the 'Bump SGL_PR_SHA when the branch advances' comment anticipates) and the by-SHA fetch fails, the subsequent git checkout --force "$SGL_PR_SHA" cannot resolve the SHA. Combined with the missing set -eo pipefail (the sister script dsv4_fp8_mi355x_vllm.sh:2 has it), the failed checkout, the test $(git rev-parse HEAD) = $SGL_PR_SHA pin guard at line 35, and any other failure are all swallowed — pip install -e python/ runs unconditionally against whatever was previously checked out, and benchmark numbers get reported against the wrong sglang. Fix: add set -eo pipefail near the top, and either drop the --depth=1 fallback or fetch amd/deepseek_v4 with deeper history (e.g. --depth=50) so older pinned SHAs remain reachable.

Extended reasoning...

Bug

The overlay block added at lines 20-41 promises a "reproducible pin per benchmark run" (see comment lines 20-23), but the control flow does not enforce it. Two compounding issues silently install the wrong sglang on a fetch/pin mismatch:

1. Shallow-fetch fallback cannot resolve a non-tip pin (lines 32-33). The clone at line 26 uses --filter=blob:none and clones only the default branch, which is not amd/deepseek_v4. The subshell then runs:

git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \
    || git fetch --depth=1 origin amd/deepseek_v4

The fallback retrieves a single commit — the current tip of amd/deepseek_v4. The very next line, git checkout --force "$SGL_PR_SHA", then fails with reference is not a tree whenever SGL_PR_SHA is not that tip. The script's own comment "Bump SGL_PR_SHA when the branch advances" explicitly anticipates that the pinned SHA lags the branch tip, so this is not a hypothetical case. The fetch-by-SHA can fail for ordinary reasons (transient GitHub error, unauthenticated rate limit, the SHA being GC'd after a force-push), and the fallback then cannot produce the requested object.

2. No set -eo pipefail (line 1 is the shebang, line 3 sources benchmark_lib.sh with no shell-options line in between). The sister script benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh:2 sets set -eo pipefail, and the sourced benchmark_lib.sh does not enable errexit either. Without it, every failure inside the subshell on lines 30-41 is swallowed:

  • the failed git checkout --force "$SGL_PR_SHA" does not abort the subshell;
  • test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" at line 35 — the explicit pin guard — only sets $?, it does not exit;
  • pip install --no-build-isolation --no-deps --force-reinstall -e python/ at line 40 then runs unconditionally, against whatever the working tree happens to contain, and its exit code becomes the subshell's, masking every earlier failure;
  • the parent script also lacks errexit, so even a non-zero subshell exit would be ignored.

The python3 -c "import sglang; print(...)" line on 43 only prints the version, it does not assert anything, and sglang.launch_server then runs against the wrong sglang.

Step-by-step proof

Assume SGL_PR_SHA = 18afbf15… and the branch amd/deepseek_v4 has since advanced to a different SHA Y. The fetch-by-SHA call fails (transient 503 from github.com):

  1. Line 25-27: clone succeeds, default branch only (no amd/deepseek_v4, no 18afbf15…).
  2. Line 32: git fetch --depth=1 origin 18afbf15… → fails (stderr suppressed via 2>/dev/null), exit 128.
  3. Line 33: git fetch --depth=1 origin amd/deepseek_v4 → succeeds, downloads only commit Y.
  4. Line 34: git checkout --force 18afbf15… → fails: fatal: reference is not a tree: 18afbf15…. Without set -e, subshell continues; HEAD is still whatever the initial clone left (default branch tip).
  5. Line 35: test "$(git rev-parse HEAD)" = "18afbf15…"$? set non-zero. Subshell continues.
  6. Line 40: pip install … -e python/ → runs against the default-branch tree (not the pin), succeeds, exit 0.
  7. Subshell exits 0 (pip's status). Parent continues.
  8. Line 43: import sglang; print(...) → prints the wrong version.
  9. sglang.launch_server runs with the wrong sglang and benchmark numbers are reported under the assumed pin.

A second realistic trigger: the by-SHA fetch succeeds on first run, then on a later invocation in the same /tmp/sglang-amd-dsv4 checkout the network is flaky. The if [ ! -d "$SGL_PR_DIR/.git" ] guard at line 25 means the clone is reused; the fallback then leaves the local repo with only commit Y; the same silent cascade follows.

Fix

Minimum: add set -eo pipefail near the top of the script (matching the vllm sister script). That alone causes the failed checkout / mismatched-HEAD test to abort the subshell and the script.

Additionally, make the fallback actually able to produce non-tip SHAs — drop --depth=1 from the branch fallback, or use a deeper fetch (e.g. --depth=50) so older pinned SHAs remain reachable.

Switch the FP8 entry to rocm/sgl-dev:rocm720-deepseek-v4-mi35x (Apr 28
ROCm 7.2.0 image), aligned with the amd/deepseek_v4 SHA pinned by the
overlay in dsv4_fp8_mi355x.sh.

Add dsv4_fp4_mi355x_sglang.sh + dsv4-fp4-mi355x-sglang config entry,
mirroring the branch's run_dsv4.sh FP4 path: SGLANG_DSV4_FP4_EXPERTS=True,
SGLANG_FORCE_TRITON_MOE_FP8=0, model deepseek-ai/DeepSeek-V4-Pro.
Match the env vars in dsv4_fp{8,4}_mi355x.sh to the amd/deepseek_v4
branch's python/run_dsv4.sh at SHA 18afbf15:

- Add SGLANG_REASONING_EFFORT=max to both scripts (was missing).
- Set SGLANG_FORCE_TRITON_MOE_FP8=0 in the FP8 script. The FP4 Models
  integration commit (sgl-project/sglang@33de1e64, Apr 28) made this the
  gating switch for aiter MoE dispatch on both paths. Our FP8 script
  predates the flip (#1134) and was still pinning 1, which forced the
  triton MoE fallback.

Switch dsv4-fp8-mi355x-sglang to deepseek-ai/DeepSeek-V4-Pro to match the
canonical checkpoint loaded by every other DSv4 entry and by run_dsv4.sh.

Drop the duplicate dsv4-fp8-mi355x-sglang perf-changelog stub left in
the worktree (#1160 entry already exists earlier in the file); replace
with a real entry covering this PR's image bump, env-var alignment,
model swap, and new FP4 variant.
rocm720-deepseek-v4-mi35x crashes on global_tokens.fill_(0) inside
_dp_gather_via_all_reduce during forward_idle:

  torch.AcceleratorError: HIP error: invalid configuration argument
  File ".../layers/dp_attention.py", line 454, in _dp_gather_via_all_reduce
    global_tokens.fill_(0)

global_tokens comes from get_global_dp_buffer(), which allocates inside
use_symmetric_memory(). The SGLANG_USE_ROCM700A workaround only swaps the
cuda-graph default padding mode to SUM_LEN; with --disable-cuda-graph we
take the eager path and still allocate via symmetric memory, hitting the
RCCL symmetric-memory bug the workaround was created for.

Switch both dsv4-fp{8,4}-mi355x-sglang entries to
v0.5.10.post1-rocm700-mi35x-20260428 (Apr 28 ROCm 7.0.0a tag) to keep the
sglang overlay perf changes shippable. Revisit rocm720 once the WA is
extended to cover eager mode or RCCL symm-mem lands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant