[6/N] (Elastic EP) Recover failed ranks#15771
[6/N] (Elastic EP) Recover failed ranks#15771UNIDY2002 wants to merge 4 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
8ce7735 to
f620cab
Compare
f620cab to
50203d8
Compare
Skip EPLB rebalance after rank recovery and directly sync expert weights to recovered ranks instead of using EPLB which causes asymmetric P2P operations due to incorrect old_expert_location_metadata for recovered ranks. The root cause is that ExpertLocationUpdater relies on both old_expert_location_metadata and new_expert_location_metadata. For a recovered rank, the old metadata is stale/incorrect, leading to wrong P2P operations being calculated (e.g., rank 6 expects to send to rank 7, but rank 7 doesn't produce a corresponding irecv op). Instead of modifying expert_location_updater.py, the correct approach is to skip EPLB after 'EPLB due to rank faults' and let the expert weights be synced through normal Mooncake EP operation. See: sgl-project#15771
50203d8 to
3cd8a7f
Compare
3cd8a7f to
c1e0869
Compare
4ad0286 to
4fdd9fe
Compare
8477c94 to
081c992
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
081c992 to
6b2f08e
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for elastic Expert Parallel (EP) rank recovery and rejoining, specifically for the Mooncake backend. Key changes include the addition of an --elastic-ep-rejoin flag, implementation of rank recovery and rejoining logic in elastic_ep.py, and the integration of recovery checks within the ModelRunner forward pass. The review feedback highlights several critical issues regarding potential deadlocks caused by non-deterministic polling and inconsistent group iteration across ranks. Additionally, there are recommendations to optimize performance in the inference critical path and to ensure that metadata broadcasting safely handles scenarios where the default source rank is the one being recovered.
- Sort groups by unique_name in _iter_live_parallel_groups to ensure consistent iteration order across ranks during collective recovery - Add early return in maybe_recover_ep_ranks when all ranks are healthy to avoid unnecessary CPU tensor copy overhead - Add comment explaining Mooncake consistency guarantee for recovery decision synchronization - Add comment in broadcast_global_expert_location_metadata explaining caller responsibility for ensuring src_rank is healthy - Mark background thread as daemon in data_parallel_controller.py
9108d84 to
a16025b
Compare
UNIDY2002
left a comment
There was a problem hiding this comment.
Response to Review Comments
Thank you for the detailed review. I've addressed the valid concerns and added explanations for the others.
Comment 1 (Critical) - Recovery synchronization
This is not an issue. The Mooncake EP backend's internal semantics guarantee that all ranks observe consistent peer readiness state through its own synchronization mechanisms. The polling appears local but is actually coordinated by the library. I've added a comment in the code explaining this: python/sglang/srt/model_executor/model_runner.py:1445
Comment 2 (High) - Dictionary iteration order
Fixed: _iter_live_parallel_groups now sorts groups by unique_name to ensure consistent iteration order across ranks during collective operations. See python/sglang/srt/elastic_ep/elastic_ep.py:106-111.
Comment 3 (High) - CPU copy overhead
Fixed: Added early return when active_ranks.all() and active_ranks_cpu.all() to avoid unnecessary tensor copies in the healthy case. See python/sglang/srt/model_executor/model_runner.py:1435-1436.
Comment 4 (Medium) - src_rank=0 safety
The function signature keeps the default for backward compatibility, but I've added a docstring comment explaining the caller's responsibility to ensure the source rank is healthy. In recovery scenarios, this function is called after all ranks have recovered.
Comment 5 (Medium) - Daemon thread
Fixed: Added daemon=True to the background thread. See python/sglang/srt/managers/data_parallel_controller.py:356.
Comment 6 (Medium) - ranks[0] for broadcast
No change needed. At this point in the recovery flow, all ranks (including rank 0) have successfully recovered, so using ranks[0] as the source is correct. Changing this could cause inconsistency with tp_worker's broadcast_pyobj synchronization.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements rank recovery for elastic Expert Parallelism (EP), enabling relaunched ranks to rejoin active process groups. It introduces the --elastic-ep-rejoin argument and adds logic to ModelRunner to detect inactive ranks, poll for their recovery via the Mooncake backend, and synchronize state across the cluster. Feedback identifies a bug in global rank calculation during seed broadcasting and potential issues with metadata synchronization if rank 0 is recovered. Additionally, improvements are suggested to prevent shared memory leaks when re-initializing message queues and to manage the lifecycle of the ZMQ background thread to avoid port binding conflicts.
- Fix critical: use get_world_group().rank for correct global rank calculation - Fix high: select healthy rank as broadcast source instead of defaulting to rank 0 - Fix medium: close existing mq_broadcaster before recreating to prevent shared memory leaks
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements rank recovery and rejoining for elastic Expert Parallel (EP) using the Mooncake backend. It introduces the --elastic-ep-rejoin flag and adds logic to synchronize process groups, broadcast expert location metadata, and reset state managers when a rank is relaunched. Key changes include updates to the distributed parallel state to handle recovered ranks, new recovery helpers in elastic_ep.py, and the integration of a health-check and recovery mechanism within the ModelRunner forward pass. Feedback focuses on a performance bottleneck caused by synchronous device-to-host transfers during the per-iteration health check and a potential logic bug where active rank tensors are not reset after a successful recovery, which could lead to infinite recovery loops.
Motivation
As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput.
The core idea is as follows. When a node fails, the system admin can relaunch the node with an additional flag
--elastic-ep-rejoin. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new processes becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.flowchart TD X[Healthy processes] --> Y[Failed process found] Y --> A A[Normal Inference Iteration] --> B{Is new process ready?} B -- No --> C[Run inference normally] C --> A B -- Yes --> D[Join new process into process group] D --> C U[New process] -->E E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.] F --> H[Load weight & capture CUDA graph] H --> DModifications
--elastic-ep-rejointo mark a relaunched elastic EP rank that should rejoin an existing Mooncake process group.Accuracy Tests
Manual multi-node recovery test:
Node rank 0:
Node rank 1:
Confirm the service is healthy and serving requests.
Terminate the process on the larger-index node (
--node-rank 1).Relaunch that same node with exactly the same command line, only adding
--elastic-ep-rejoin:Expected result:
recover ranks [...] done.Benchmarking and Profiling
Checklist