fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814

sunxxuns · 2025-11-23T20:05:46Z

Summary

This PR fixes two critical issues causing AMD CI failures:

1. HIP layernorm implementation bug

Issue: TypeError: fused_add_rms_norm() missing 2 required positional arguments
Root cause: Incorrect function signature - was calling with 4 args instead of required 6
Fix: Create output tensors and pass all 6 parameters: (out, input, residual_out, residual, weight, epsilon)

2. Intermittent PyPI connectivity failures

Issue: Random failures with ERROR: No matching distribution found for black>=24.1.0
Root cause: Unstable PyPI connectivity on AMD runners
Fix: Add retry logic (3 attempts) with fallback to Aliyun PyPI mirror + clear Python cache

Test Results

All AMD CI tests now pass consistently ✅

All unit-test-backend-*-gpu-amd jobs: PASS
All accuracy-test-*-gpu-amd jobs: PASS
All performance-test-*-gpu-amd jobs: PASS
All stage-a-test-*-amd jobs: PASS

Run: https://github.com/sgl-project/sglang/actions/runs/19693108888

gemini-code-assist · 2025-11-23T20:05:51Z

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

Two critical AMD CI fixes: 1. Fix fused_add_rms_norm signature in forward_hip (layernorm.py) - Corrected call to match 6-param signature (out, input, residual_out, residual, weight, epsilon) - Fixes TypeError: missing 2 required positional arguments 2. Add retry logic and PyPI mirror fallback (amd_ci_install_dependency.sh) - Retry pip installs up to 3 times with 5s delays - Falls back to Aliyun PyPI mirror on 2nd attempt - Clear Python bytecode cache before installs - Fixes intermittent PyPI connectivity errors All AMD CI tests now pass consistently.

The chown command can fail when multiple CI jobs access the shared pip cache simultaneously. Redirect stderr and continue on errors to prevent flaky failures from race conditions on temporary cache files.

cboillot · 2025-12-03T13:22:45Z

Hi @sunxxuns,

When i try this with sglang + vllm (recompiled 0.11.1 but it's true for other vllm versions too as far as i can check) , I get the opposite message :
.venv/lib/python3.11/site-packages/sglang/srt/layers/layernorm.py", line 170, in forward_hip
fused_add_rms_norm(
TypeError: fused_add_rms_norm() takes 4 positional arguments but 6 were given
It seems that the function refers to:
https://github.com/vllm-project/vllm/blob/15b1511a15dfb1d56048847da755213632c07b29/vllm/_custom_ops.py#L334
To which vllm version does your 6 parameters fused_add_rms_norm refer to ?

Thanks

…l-project#13814) Co-authored-by: root <[email protected]>

sunxxuns requested review from Fridge003, Kangyan-Zhou, ispobock and merrymercy as code owners November 23, 2025 20:05

sunxxuns added the run-ci label Nov 23, 2025

github-actions bot added amd documentation Improvements or additions to documentation labels Nov 23, 2025

sunxxuns force-pushed the ci/add-runner-fallback-support branch from a75f571 to c984b12 Compare November 25, 2025 06:55

sunxxuns requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123 and ch-wan as code owners November 25, 2025 18:24

sunxxuns changed the title ~~amd ci: Use generic AMD runner labels with timeouts to prevent queue bloc…~~ ci: Fix AMD HIP layernorm implementation and enable AMD CI Nov 25, 2025

sunxxuns force-pushed the ci/add-runner-fallback-support branch 2 times, most recently from 0bcc7dc to 7e4f8ca Compare November 26, 2025 05:40

sunxxuns changed the title ~~ci: Fix AMD HIP layernorm implementation and enable AMD CI~~ fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity Nov 26, 2025

sunxxuns force-pushed the ci/add-runner-fallback-support branch 3 times, most recently from 56c5f51 to e73f469 Compare November 26, 2025 22:17

sunxxuns requested review from AniZpZ, CatherineSue, FlamingoPg, hnyls2002, key4ng, mickqian, slin1237, xiezhq-hermann and zhyncs as code owners November 26, 2025 22:17

github-actions bot added quant LLM Quantization Multi-modal multi-modal language model deepseek npu diffusion SGLang Diffusion model-gateway labels Nov 26, 2025

sunxxuns force-pushed the ci/add-runner-fallback-support branch 2 times, most recently from e9eae15 to 0d62bf6 Compare November 27, 2025 01:48

HaiShaw approved these changes Nov 27, 2025

View reviewed changes

root added 2 commits November 26, 2025 21:42

fix: Ignore pip cache chown errors from concurrent CI jobs

3d34547

The chown command can fail when multiple CI jobs access the shared pip cache simultaneously. Redirect stderr and continue on errors to prevent flaky failures from race conditions on temporary cache files.

sunxxuns force-pushed the ci/add-runner-fallback-support branch from 0d62bf6 to 3d34547 Compare November 27, 2025 02:42

sunxxuns enabled auto-merge (squash) November 27, 2025 02:43

hnyls2002 disabled auto-merge November 27, 2025 03:30

hnyls2002 merged commit 5443db8 into sgl-project:main Nov 27, 2025
55 of 78 checks passed

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025

fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity (sg…

d5a76fa

…l-project#13814) Co-authored-by: root <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814

fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814

Uh oh!

sunxxuns commented Nov 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 23, 2025

Uh oh!

Uh oh!

cboillot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814

fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814

Uh oh!

Conversation

sunxxuns commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. HIP layernorm implementation bug

2. Intermittent PyPI connectivity failures

Test Results

Uh oh!

gemini-code-assist bot commented Nov 23, 2025

Uh oh!

Uh oh!

cboillot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sunxxuns commented Nov 23, 2025 •

edited

Loading