-
Notifications
You must be signed in to change notification settings - Fork 3.7k
fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity #13814
Conversation
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
a75f571 to
c984b12
Compare
0bcc7dc to
7e4f8ca
Compare
56c5f51 to
e73f469
Compare
e9eae15 to
0d62bf6
Compare
Two critical AMD CI fixes: 1. Fix fused_add_rms_norm signature in forward_hip (layernorm.py) - Corrected call to match 6-param signature (out, input, residual_out, residual, weight, epsilon) - Fixes TypeError: missing 2 required positional arguments 2. Add retry logic and PyPI mirror fallback (amd_ci_install_dependency.sh) - Retry pip installs up to 3 times with 5s delays - Falls back to Aliyun PyPI mirror on 2nd attempt - Clear Python bytecode cache before installs - Fixes intermittent PyPI connectivity errors All AMD CI tests now pass consistently.
The chown command can fail when multiple CI jobs access the shared pip cache simultaneously. Redirect stderr and continue on errors to prevent flaky failures from race conditions on temporary cache files.
0d62bf6 to
3d34547
Compare
|
Hi @sunxxuns, When i try this with sglang + vllm (recompiled 0.11.1 but it's true for other vllm versions too as far as i can check) , I get the opposite message : Thanks |
Summary
This PR fixes two critical issues causing AMD CI failures:
1. HIP layernorm implementation bug
TypeError: fused_add_rms_norm() missing 2 required positional arguments(out, input, residual_out, residual, weight, epsilon)2. Intermittent PyPI connectivity failures
ERROR: No matching distribution found for black>=24.1.0Test Results
All AMD CI tests now pass consistently ✅
unit-test-backend-*-gpu-amdjobs: PASSaccuracy-test-*-gpu-amdjobs: PASSperformance-test-*-gpu-amdjobs: PASSstage-a-test-*-amdjobs: PASSRun: https://github.com/sgl-project/sglang/actions/runs/19693108888