[6/N] (Elastic EP) Recover failed ranks by UNIDY2002 · Pull Request #15771 · sgl-project/sglang

UNIDY2002 · 2025-12-24T14:54:42Z

Motivation

As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput.

The core idea is as follows. When a node fails, the system admin can relaunch the node with an additional flag --elastic-ep-rejoin. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new processes becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.

flowchart TD
    X[Healthy processes] --> Y[Failed process found]
    Y --> A
    A[Normal Inference Iteration] --> B{Is new process ready?}
    B -- No --> C[Run inference normally]
    C --> A
    B -- Yes --> D[Join new process into process group]
    D --> C

    U[New process] -->E
    E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.]
    F --> H[Load weight & capture CUDA graph]
    H --> D

Modifications

Add a new server flag --elastic-ep-rejoin to mark a relaunched elastic EP rank that should rejoin an existing Mooncake process group.
Spread the recovered-rank state throughout distributed initialization so the process groups are created in recovery mode when needed.
Add elastic EP recovery helpers that poll peer readiness and recover all process groups.
Resynchronize expert placement metadata after recovery so recovered ranks observe the same global expert-location state as healthy ranks.
Reset elastic EP active-rank state and the EPLB generator after rank recovery so scheduling resumes from a consistent post-rejoin state.

Accuracy Tests

Manual multi-node recovery test:

Start serving on 2 nodes with elastic EP enabled and enough redundant experts to tolerate one node failure.

Node rank 0:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 0

Node rank 1:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1

Confirm the service is healthy and serving requests.
Terminate the process on the larger-index node (--node-rank 1).
Relaunch that same node with exactly the same command line, only adding --elastic-ep-rejoin:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1 --elastic-ep-rejoin

Verify the relaunched rank rejoins successfully and the cluster returns to the healthy state.

Expected result:

The healthy node continues serving while the replacement rank initializes.
After the relaunched rank becomes ready, the process groups recover and the node rejoins successfully.
Logs on healthy ranks should show recovery completing, e.g. recover ranks [...] done.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-24T14:54:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Skip EPLB rebalance after rank recovery and directly sync expert weights to recovered ranks instead of using EPLB which causes asymmetric P2P operations due to incorrect old_expert_location_metadata for recovered ranks. The root cause is that ExpertLocationUpdater relies on both old_expert_location_metadata and new_expert_location_metadata. For a recovered rank, the old metadata is stale/incorrect, leading to wrong P2P operations being calculated (e.g., rank 6 expects to send to rank 7, but rank 7 doesn't produce a corresponding irecv op). Instead of modifying expert_location_updater.py, the correct approach is to skip EPLB after 'EPLB due to rank faults' and let the expert weights be synced through normal Mooncake EP operation. See: sgl-project#15771

gemini-code-assist · 2026-04-07T06:22:02Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

UNIDY2002 · 2026-04-10T14:33:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for elastic Expert Parallel (EP) rank recovery and rejoining, specifically for the Mooncake backend. Key changes include the addition of an --elastic-ep-rejoin flag, implementation of rank recovery and rejoining logic in elastic_ep.py, and the integration of recovery checks within the ModelRunner forward pass. The review feedback highlights several critical issues regarding potential deadlocks caused by non-deterministic polling and inconsistent group iteration across ranks. Additionally, there are recommendations to optimize performance in the inference critical path and to ensure that metadata broadcasting safely handles scenarios where the default source rank is the one being recovered.

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/elastic_ep/elastic_ep.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/eplb/expert_location.py

python/sglang/srt/managers/data_parallel_controller.py

python/sglang/srt/model_executor/model_runner.py

- Sort groups by unique_name in _iter_live_parallel_groups to ensure consistent iteration order across ranks during collective recovery - Add early return in maybe_recover_ep_ranks when all ranks are healthy to avoid unnecessary CPU tensor copy overhead - Add comment explaining Mooncake consistency guarantee for recovery decision synchronization - Add comment in broadcast_global_expert_location_metadata explaining caller responsibility for ensuring src_rank is healthy - Mark background thread as daemon in data_parallel_controller.py

UNIDY2002

Response to Review Comments

Thank you for the detailed review. I've addressed the valid concerns and added explanations for the others.

Comment 1 (Critical) - Recovery synchronization

This is not an issue. The Mooncake EP backend's internal semantics guarantee that all ranks observe consistent peer readiness state through its own synchronization mechanisms. The polling appears local but is actually coordinated by the library. I've added a comment in the code explaining this: python/sglang/srt/model_executor/model_runner.py:1445

Comment 2 (High) - Dictionary iteration order

Fixed: _iter_live_parallel_groups now sorts groups by unique_name to ensure consistent iteration order across ranks during collective operations. See python/sglang/srt/elastic_ep/elastic_ep.py:106-111.

Comment 3 (High) - CPU copy overhead

Fixed: Added early return when active_ranks.all() and active_ranks_cpu.all() to avoid unnecessary tensor copies in the healthy case. See python/sglang/srt/model_executor/model_runner.py:1435-1436.

Comment 4 (Medium) - src_rank=0 safety

The function signature keeps the default for backward compatibility, but I've added a docstring comment explaining the caller's responsibility to ensure the source rank is healthy. In recovery scenarios, this function is called after all ranks have recovered.

Comment 5 (Medium) - Daemon thread

Fixed: Added daemon=True to the background thread. See python/sglang/srt/managers/data_parallel_controller.py:356.

Comment 6 (Medium) - ranks[0] for broadcast

No change needed. At this point in the recovery flow, all ranks (including rank 0) have successfully recovered, so using ranks[0] as the source is correct. Changing this could cause inconsistency with tp_worker's broadcast_pyobj synchronization.

UNIDY2002 · 2026-04-12T07:52:48Z

/gemini review

gemini-code-assist

Code Review

This pull request implements rank recovery for elastic Expert Parallelism (EP), enabling relaunched ranks to rejoin active process groups. It introduces the --elastic-ep-rejoin argument and adds logic to ModelRunner to detect inactive ranks, poll for their recovery via the Mooncake backend, and synchronize state across the cluster. Feedback identifies a bug in global rank calculation during seed broadcasting and potential issues with metadata synchronization if rank 0 is recovered. Additionally, improvements are suggested to prevent shared memory leaks when re-initializing message queues and to manage the lifecycle of the ZMQ background thread to avoid port binding conflicts.

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/eplb/expert_location.py

python/sglang/srt/elastic_ep/elastic_ep.py

python/sglang/srt/managers/data_parallel_controller.py

- Fix critical: use get_world_group().rank for correct global rank calculation - Fix high: select healthy rank as broadcast source instead of defaulting to rank 0 - Fix medium: close existing mq_broadcaster before recreating to prevent shared memory leaks

UNIDY2002 · 2026-04-12T10:18:56Z

/gemini review

gemini-code-assist

Code Review

This pull request implements rank recovery and rejoining for elastic Expert Parallel (EP) using the Mooncake backend. It introduces the --elastic-ep-rejoin flag and adds logic to synchronize process groups, broadcast expert location metadata, and reset state managers when a rank is relaunched. Key changes include updates to the distributed parallel state to handle recovered ranks, new recovery helpers in elastic_ep.py, and the integration of a health-check and recovery mechanism within the ModelRunner forward pass. Feedback focuses on a performance bottleneck caused by synchronous device-to-host transfers during the per-iteration health check and a potential logic bug where active rank tensors are not reset after a successful recovery, which could lead to infinite recovery loops.

python/sglang/srt/model_executor/model_runner.py

This was referenced Dec 24, 2025

[Elastic EP] First draft of scaling up #14118

Closed

Elastic EP Support (Milestone 1 & 2) #8961

Closed

UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 8ce7735 to f620cab Compare January 25, 2026 14:44

UNIDY2002 mentioned this pull request Mar 5, 2026

[PG] Share P2PProxy/ConnectionPoller threads across backends. kvcache-ai/Mooncake#1607

Merged

21 tasks

UNIDY2002 force-pushed the mooncake-pr-recovery branch from f620cab to 50203d8 Compare March 5, 2026 06:37

UNIDY2002 force-pushed the mooncake-pr-recovery branch from 50203d8 to 3cd8a7f Compare March 10, 2026 03:17

UNIDY2002 force-pushed the mooncake-pr-recovery branch from 3cd8a7f to c1e0869 Compare March 17, 2026 06:18

UNIDY2002 mentioned this pull request Mar 25, 2026

[PG] Enable asynchronous recovered-rank initialization with deferred join kvcache-ai/Mooncake#1744

Merged

21 tasks

UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 4ad0286 to 4fdd9fe Compare March 28, 2026 08:21

UNIDY2002 force-pushed the mooncake-pr-recovery branch 3 times, most recently from 8477c94 to 081c992 Compare April 7, 2026 05:44

github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026

UNIDY2002 marked this pull request as ready for review April 7, 2026 06:21

UNIDY2002 requested review from Fridge003, Ying1123, ch-wan, fzyzcjy, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 7, 2026 06:22

UNIDY2002 force-pushed the mooncake-pr-recovery branch from 081c992 to 6b2f08e Compare April 7, 2026 07:49

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

UNIDY2002 and others added 2 commits April 12, 2026 14:49

Recover ranks

7154123

UNIDY2002 force-pushed the mooncake-pr-recovery branch from 9108d84 to a16025b Compare April 12, 2026 07:40

UNIDY2002 commented Apr 12, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Show resolved Hide resolved

python/sglang/srt/model_executor/model_runner.py Show resolved Hide resolved

Fix

fdac946

Conversation

UNIDY2002 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

gemini-code-assist bot commented Apr 7, 2026

Uh oh!

UNIDY2002 commented Apr 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UNIDY2002 left a comment

Choose a reason for hiding this comment

Response to Review Comments

Comment 1 (Critical) - Recovery synchronization

Comment 2 (High) - Dictionary iteration order

Comment 3 (High) - CPU copy overhead

Comment 4 (Medium) - src_rank=0 safety

Comment 5 (Medium) - Daemon thread

Comment 6 (Medium) - ranks[0] for broadcast

Uh oh!

UNIDY2002 commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UNIDY2002 commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

UNIDY2002 commented Dec 24, 2025 •

edited

Loading