ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back by sunxxuns · Pull Request #14226 · sgl-project/sglang

sunxxuns · 2025-12-01T14:29:50Z

Summary

Migrate AMD CI infrastructure from TensorWave cluster to new MI325 runners due to capacity and allocation changes.

Changes

Update all AMD CI runner labels from linux-mi300-gpu-* to linux-mi325-gpu-*.test
Affects both PR tests (pr-test-amd.yml) and nightly tests (nightly-test-amd.yml)
Updated runner configurations:
- linux-mi325-gpu-1.test (1-GPU runners)
- linux-mi325-gpu-2.test (2-GPU runners)
- linux-mi325-gpu-8.test (8-GPU runners)

Testing Plan

This PR will validate the new MI325 runner infrastructure by running the full AMD CI suite:

✅ sgl-kernel unit tests (1-GPU)
✅ Stage-A tests (1-GPU)
✅ Backend unit tests (1-GPU, 2-GPU, 8-GPU)
✅ Performance tests (1-GPU, 2-GPU)
✅ Accuracy tests (1-GPU, 2-GPU)

Migration Context

Per infrastructure team request, we need to:

Test and validate new MI325 runners
Migrate off old TensorWave cluster
Confirm before teardown of old runners

Risk Assessment

Low risk: Only changes runner labels, no code changes
Rollback: Can revert to old runners if issues found
Impact: AMD CI pipeline on new infrastructure

cc @Reviewer - Please validate that the new MI325 runners work correctly with this test run.

gemini-code-assist · 2025-12-01T14:29:55Z

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

- Update all AMD CI runners from linux-mi300-gpu-* to linux-mi325-gpu-*.test - Update runner labels: 1-gpu, 2-gpu, 4-gpu, 8-gpu configurations - Affects both PR tests and nightly tests - Testing migration from TensorWave cluster to new infrastructure Runner labels: - linux-mi325-gpu-1.test - linux-mi325-gpu-2.test - linux-mi325-gpu-4.test (not yet used) - linux-mi325-gpu-8.test

The XetHub (hf-xet) download protocol fails intermittently with CAS service errors. Disable it for the accuracy-test-2-gpu-amd job to use standard HuggingFace downloads instead, which are more reliable. Error seen: RuntimeError: Data processing error: CAS service error : IO Error: No such file or directory (os error 2) Fix: Set HF_HUB_ENABLE_HF_TRANSFER=0 to disable hf-xet protocol.

There are only 2 tests in the per-commit-8-gpu-amd suite, but the CI was configured to run 3 shards, wasting one 8 GPU runner per CI run.

Remove .test suffix from all MI325 runner labels as we prepare to switch from test runners to production runners.

Remove .test suffix from MI325 runner label in nightly test workflow.

Add a diagnostic test to verify RCCL communication works across 8 GPUs before running the actual test suite. This will help identify runner configuration issues early with detailed debug output. The test performs a simple allreduce operation across all 8 GPUs with NCCL_DEBUG=INFO and RCCL_DEBUG=INFO enabled to capture detailed logs.

The RCCL diagnostic test revealed that /dev/shm is full with stale NCCL/RCCL shared memory files, causing "No space left on device" errors. Add cleanup steps to: 1. Clean host /dev/shm before starting container (--ipc=host shares it) 2. Clean container /dev/shm after starting container 3. Verify /dev/shm space is available This should resolve the RCCL initialization failures on MI325 runners.

The host /dev/shm is only 64 MB on CI runners, but RCCL needs ~85 MB for 8 GPU communication (10.6 MB per GPU). Using --ipc=host forces the container to use the host's tiny 64 MB /dev/shm instead of its own 32 GB allocation. Removing --ipc=host allows the container to use its full 32 GB /dev/shm, fixing the "No space left on device" errors during RCCL initialization. Verified locally that RCCL works with sufficient /dev/shm space.

The DeepSeek V3 tests are generating garbage output on MI325 runners but the RCCL infrastructure is confirmed working (test passes). Revert to MI300 runners to confirm the tests pass there, isolating whether this is an MI325X-specific model compatibility issue. Keep all infrastructure fixes: - RCCL diagnostic test - /dev/shm cleanup - Removed --ipc=host for proper 32GB shm - Reduced shards from 3 to 2

- Remove unnecessary /dev/shm cleanup steps (not needed since --ipc=host was removed) - Temporarily disable 8-GPU test job - Add failing models to nightly test exclusion list: * neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8 (GEMM not supported) * zai-org/GLM-4.5-Air-FP8 (ForwardMetadata unpack error)

HAIAI

lint

- Make script executable (chmod +x) - Fix import ordering (add blank line after stdlib imports) - Fix code formatting (remove trailing whitespace, wrap long lines) These changes address the pre-commit hook failures: - check-shebang-scripts-are-executable - isort - black-jupyter

…failed CI's to be added back (sgl-project#14226)

sunxxuns requested review from Fridge003, Kangyan-Zhou, ispobock and merrymercy as code owners December 1, 2025 14:29

sunxxuns added the run-ci label Dec 1, 2025

github-actions bot added the amd label Dec 1, 2025

sunxxuns added amd run-ci and removed amd run-ci labels Dec 1, 2025

sunxxuns force-pushed the test-mi325-runners branch 4 times, most recently from bc54f3a to ccdb88a Compare December 2, 2025 15:50

root added 10 commits December 3, 2025 16:36

ci: Reduce AMD 8 GPU test shards from 3 to 2

f8291a2

There are only 2 tests in the per-commit-8-gpu-amd suite, but the CI was configured to run 3 shards, wasting one 8 GPU runner per CI run.

ci: Remove .test suffix from MI325 runner labels

6188742

Remove .test suffix from all MI325 runner labels as we prepare to switch from test runners to production runners.

ci: Remove .test suffix from nightly AMD runner label

4987629

Remove .test suffix from MI325 runner label in nightly test workflow.

sunxxuns force-pushed the test-mi325-runners branch from f2216a2 to 8504018 Compare December 3, 2025 16:38

HAIAI approved these changes Dec 3, 2025

View reviewed changes

HAIAI suggested changes Dec 3, 2025

View reviewed changes

sunxxuns changed the title ~~ci: Migrate AMD workflows to new MI325 runners~~ ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back Dec 3, 2025

Merge branch 'main' into test-mi325-runners

81c3147

HaiShaw approved these changes Dec 3, 2025

View reviewed changes

HaiShaw merged commit 5bbd83a into main Dec 3, 2025
28 of 67 checks passed

HaiShaw deleted the test-mi325-runners branch December 3, 2025 19:33

tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

d3fc2b9

…failed CI's to be added back (sgl-project#14226)

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

8ea407a

…failed CI's to be added back (sgl-project#14226)

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

e48a320

…failed CI's to be added back (sgl-project#14226)

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

73eba04

…failed CI's to be added back (sgl-project#14226)

sunxxuns added a commit to sunxxuns/sglang that referenced this pull request Dec 5, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

4647ab1

…failed CI's to be added back (sgl-project#14226)

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

278515c

…failed CI's to be added back (sgl-project#14226)

Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

470fcc4

…failed CI's to be added back (sgl-project#14226)

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

a179093

…failed CI's to be added back (sgl-project#14226)

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled …

c92d0e3

…failed CI's to be added back (sgl-project#14226)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back#14226

ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back#14226
HaiShaw merged 12 commits intomainfrom
test-mi325-runners

sunxxuns commented Dec 1, 2025

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Uh oh!

HAIAI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

sunxxuns commented Dec 1, 2025

Summary

Changes

Testing Plan

Migration Context

Risk Assessment

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Uh oh!

HAIAI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments