ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back#14226
Merged
ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back#14226
Conversation
Contributor
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
bc54f3a to
ccdb88a
Compare
added 10 commits
December 3, 2025 16:36
- Update all AMD CI runners from linux-mi300-gpu-* to linux-mi325-gpu-*.test - Update runner labels: 1-gpu, 2-gpu, 4-gpu, 8-gpu configurations - Affects both PR tests and nightly tests - Testing migration from TensorWave cluster to new infrastructure Runner labels: - linux-mi325-gpu-1.test - linux-mi325-gpu-2.test - linux-mi325-gpu-4.test (not yet used) - linux-mi325-gpu-8.test
The XetHub (hf-xet) download protocol fails intermittently with CAS service errors. Disable it for the accuracy-test-2-gpu-amd job to use standard HuggingFace downloads instead, which are more reliable. Error seen: RuntimeError: Data processing error: CAS service error : IO Error: No such file or directory (os error 2) Fix: Set HF_HUB_ENABLE_HF_TRANSFER=0 to disable hf-xet protocol.
There are only 2 tests in the per-commit-8-gpu-amd suite, but the CI was configured to run 3 shards, wasting one 8 GPU runner per CI run.
Remove .test suffix from all MI325 runner labels as we prepare to switch from test runners to production runners.
Remove .test suffix from MI325 runner label in nightly test workflow.
Add a diagnostic test to verify RCCL communication works across 8 GPUs before running the actual test suite. This will help identify runner configuration issues early with detailed debug output. The test performs a simple allreduce operation across all 8 GPUs with NCCL_DEBUG=INFO and RCCL_DEBUG=INFO enabled to capture detailed logs.
The RCCL diagnostic test revealed that /dev/shm is full with stale NCCL/RCCL shared memory files, causing "No space left on device" errors. Add cleanup steps to: 1. Clean host /dev/shm before starting container (--ipc=host shares it) 2. Clean container /dev/shm after starting container 3. Verify /dev/shm space is available This should resolve the RCCL initialization failures on MI325 runners.
The host /dev/shm is only 64 MB on CI runners, but RCCL needs ~85 MB for 8 GPU communication (10.6 MB per GPU). Using --ipc=host forces the container to use the host's tiny 64 MB /dev/shm instead of its own 32 GB allocation. Removing --ipc=host allows the container to use its full 32 GB /dev/shm, fixing the "No space left on device" errors during RCCL initialization. Verified locally that RCCL works with sufficient /dev/shm space.
The DeepSeek V3 tests are generating garbage output on MI325 runners but the RCCL infrastructure is confirmed working (test passes). Revert to MI300 runners to confirm the tests pass there, isolating whether this is an MI325X-specific model compatibility issue. Keep all infrastructure fixes: - RCCL diagnostic test - /dev/shm cleanup - Removed --ipc=host for proper 32GB shm - Reduced shards from 3 to 2
- Remove unnecessary /dev/shm cleanup steps (not needed since --ipc=host was removed) - Temporarily disable 8-GPU test job - Add failing models to nightly test exclusion list: * neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8 (GEMM not supported) * zai-org/GLM-4.5-Air-FP8 (ForwardMetadata unpack error)
f2216a2 to
8504018
Compare
HAIAI
approved these changes
Dec 3, 2025
- Make script executable (chmod +x) - Fix import ordering (add blank line after stdlib imports) - Fix code formatting (remove trailing whitespace, wrap long lines) These changes address the pre-commit hook failures: - check-shebang-scripts-are-executable - isort - black-jupyter
HaiShaw
approved these changes
Dec 3, 2025
tom-jerr
pushed a commit
to tom-jerr/sglang
that referenced
this pull request
Dec 4, 2025
…failed CI's to be added back (sgl-project#14226)
yingluosanqian
pushed a commit
to yingluosanqian/sglang
that referenced
this pull request
Dec 4, 2025
…failed CI's to be added back (sgl-project#14226)
tonyluj
pushed a commit
to openanolis/sglang
that referenced
this pull request
Dec 5, 2025
…failed CI's to be added back (sgl-project#14226)
tonyluj
pushed a commit
to openanolis/sglang
that referenced
this pull request
Dec 5, 2025
…failed CI's to be added back (sgl-project#14226)
sunxxuns
added a commit
to sunxxuns/sglang
that referenced
this pull request
Dec 5, 2025
…failed CI's to be added back (sgl-project#14226)
yuchengz816-bot
pushed a commit
to yuchengz816-bot/sglang
that referenced
this pull request
Dec 8, 2025
…failed CI's to be added back (sgl-project#14226)
Kevin-XiongC
pushed a commit
to novitalabs/sglang
that referenced
this pull request
Dec 9, 2025
…failed CI's to be added back (sgl-project#14226)
tonyluj
pushed a commit
to openanolis/sglang
that referenced
this pull request
Dec 12, 2025
…failed CI's to be added back (sgl-project#14226)
tonyluj
pushed a commit
to openanolis/sglang
that referenced
this pull request
Dec 12, 2025
…failed CI's to be added back (sgl-project#14226)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate AMD CI infrastructure from TensorWave cluster to new MI325 runners due to capacity and allocation changes.
Changes
linux-mi300-gpu-*tolinux-mi325-gpu-*.testpr-test-amd.yml) and nightly tests (nightly-test-amd.yml)linux-mi325-gpu-1.test(1-GPU runners)linux-mi325-gpu-2.test(2-GPU runners)linux-mi325-gpu-8.test(8-GPU runners)Testing Plan
This PR will validate the new MI325 runner infrastructure by running the full AMD CI suite:
Migration Context
Per infrastructure team request, we need to:
Risk Assessment
cc @Reviewer - Please validate that the new MI325 runners work correctly with this test run.