Skip to content

Conversation

@sunxxuns
Copy link
Collaborator

@sunxxuns sunxxuns commented Nov 23, 2025

Summary

This PR fixes two critical issues causing AMD CI failures:

1. HIP layernorm implementation bug

  • Issue: TypeError: fused_add_rms_norm() missing 2 required positional arguments
  • Root cause: Incorrect function signature - was calling with 4 args instead of required 6
  • Fix: Create output tensors and pass all 6 parameters: (out, input, residual_out, residual, weight, epsilon)

2. Intermittent PyPI connectivity failures

  • Issue: Random failures with ERROR: No matching distribution found for black>=24.1.0
  • Root cause: Unstable PyPI connectivity on AMD runners
  • Fix: Add retry logic (3 attempts) with fallback to Aliyun PyPI mirror + clear Python cache

Test Results

All AMD CI tests now pass consistently ✅

  • All unit-test-backend-*-gpu-amd jobs: PASS
  • All accuracy-test-*-gpu-amd jobs: PASS
  • All performance-test-*-gpu-amd jobs: PASS
  • All stage-a-test-*-amd jobs: PASS

Run: https://github.com/sgl-project/sglang/actions/runs/19693108888

@gemini-code-assist
Copy link
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@github-actions github-actions bot added amd documentation Improvements or additions to documentation labels Nov 23, 2025
@sunxxuns sunxxuns force-pushed the ci/add-runner-fallback-support branch from a75f571 to c984b12 Compare November 25, 2025 06:55
@sunxxuns sunxxuns changed the title amd ci: Use generic AMD runner labels with timeouts to prevent queue bloc… ci: Fix AMD HIP layernorm implementation and enable AMD CI Nov 25, 2025
@sunxxuns sunxxuns force-pushed the ci/add-runner-fallback-support branch 2 times, most recently from 0bcc7dc to 7e4f8ca Compare November 26, 2025 05:40
@sunxxuns sunxxuns changed the title ci: Fix AMD HIP layernorm implementation and enable AMD CI fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity Nov 26, 2025
@sunxxuns sunxxuns force-pushed the ci/add-runner-fallback-support branch 3 times, most recently from 56c5f51 to e73f469 Compare November 26, 2025 22:17
@github-actions github-actions bot added quant LLM Quantization Multi-modal multi-modal language model deepseek npu diffusion SGLang Diffusion model-gateway labels Nov 26, 2025
@sunxxuns sunxxuns force-pushed the ci/add-runner-fallback-support branch 2 times, most recently from e9eae15 to 0d62bf6 Compare November 27, 2025 01:48
root added 2 commits November 26, 2025 21:42
Two critical AMD CI fixes:

1. Fix fused_add_rms_norm signature in forward_hip (layernorm.py)
   - Corrected call to match 6-param signature (out, input, residual_out, residual, weight, epsilon)
   - Fixes TypeError: missing 2 required positional arguments

2. Add retry logic and PyPI mirror fallback (amd_ci_install_dependency.sh)
   - Retry pip installs up to 3 times with 5s delays
   - Falls back to Aliyun PyPI mirror on 2nd attempt
   - Clear Python bytecode cache before installs
   - Fixes intermittent PyPI connectivity errors

All AMD CI tests now pass consistently.
The chown command can fail when multiple CI jobs access the shared
pip cache simultaneously. Redirect stderr and continue on errors
to prevent flaky failures from race conditions on temporary cache files.
@sunxxuns sunxxuns force-pushed the ci/add-runner-fallback-support branch from 0d62bf6 to 3d34547 Compare November 27, 2025 02:42
@sunxxuns sunxxuns enabled auto-merge (squash) November 27, 2025 02:43
@hnyls2002 hnyls2002 disabled auto-merge November 27, 2025 03:30
@hnyls2002 hnyls2002 merged commit 5443db8 into sgl-project:main Nov 27, 2025
55 of 78 checks passed
@cboillot
Copy link

cboillot commented Dec 3, 2025

Hi @sunxxuns,

When i try this with sglang + vllm (recompiled 0.11.1 but it's true for other vllm versions too as far as i can check) , I get the opposite message :
.venv/lib/python3.11/site-packages/sglang/srt/layers/layernorm.py", line 170, in forward_hip
fused_add_rms_norm(
TypeError: fused_add_rms_norm() takes 4 positional arguments but 6 were given
It seems that the function refers to:
https://github.com/vllm-project/vllm/blob/15b1511a15dfb1d56048847da755213632c07b29/vllm/_custom_ops.py#L334
To which vllm version does your 6 parameters fused_add_rms_norm refer to ?

Thanks

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek diffusion SGLang Diffusion documentation Improvements or additions to documentation model-gateway Multi-modal multi-modal language model npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants