Skip to content

[6/N] (Elastic EP) Recover failed ranks#15771

Open
UNIDY2002 wants to merge 4 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery
Open

[6/N] (Elastic EP) Recover failed ranks#15771
UNIDY2002 wants to merge 4 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery

Conversation

@UNIDY2002
Copy link
Copy Markdown
Contributor

@UNIDY2002 UNIDY2002 commented Dec 24, 2025

Motivation

As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput.

The core idea is as follows. When a node fails, the system admin can relaunch the node with an additional flag --elastic-ep-rejoin. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new processes becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.

flowchart TD
    X[Healthy processes] --> Y[Failed process found]
    Y --> A
    A[Normal Inference Iteration] --> B{Is new process ready?}
    B -- No --> C[Run inference normally]
    C --> A
    B -- Yes --> D[Join new process into process group]
    D --> C

    U[New process] -->E
    E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.]
    F --> H[Load weight & capture CUDA graph]
    H --> D
Loading

Modifications

  • Add a new server flag --elastic-ep-rejoin to mark a relaunched elastic EP rank that should rejoin an existing Mooncake process group.
  • Spread the recovered-rank state throughout distributed initialization so the process groups are created in recovery mode when needed.
  • Add elastic EP recovery helpers that poll peer readiness and recover all process groups.
  • Resynchronize expert placement metadata after recovery so recovered ranks observe the same global expert-location state as healthy ranks.
  • Reset elastic EP active-rank state and the EPLB generator after rank recovery so scheduling resumes from a consistent post-rejoin state.

Accuracy Tests

Manual multi-node recovery test:

  1. Start serving on 2 nodes with elastic EP enabled and enough redundant experts to tolerate one node failure.

Node rank 0:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 0

Node rank 1:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1
  1. Confirm the service is healthy and serving requests.

  2. Terminate the process on the larger-index node (--node-rank 1).

  3. Relaunch that same node with exactly the same command line, only adding --elastic-ep-rejoin:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1 --elastic-ep-rejoin
  1. Verify the relaunched rank rejoins successfully and the cluster returns to the healthy state.

Expected result:

  • The healthy node continues serving while the replacement rank initializes.
  • After the relaunched rank becomes ready, the process groups recover and the node rejoins successfully.
  • Logs on healthy ranks should show recovery completing, e.g. recover ranks [...] done.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 8ce7735 to f620cab Compare January 25, 2026 14:44
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from f620cab to 50203d8 Compare March 5, 2026 06:37
UNIDY2002 added a commit to HanHan009527/sglang that referenced this pull request Mar 9, 2026
Skip EPLB rebalance after rank recovery and directly sync expert weights
to recovered ranks instead of using EPLB which causes asymmetric P2P
operations due to incorrect old_expert_location_metadata for recovered ranks.

The root cause is that ExpertLocationUpdater relies on both
old_expert_location_metadata and new_expert_location_metadata. For a
recovered rank, the old metadata is stale/incorrect, leading to wrong
P2P operations being calculated (e.g., rank 6 expects to send to rank 7,
but rank 7 doesn't produce a corresponding irecv op).

Instead of modifying expert_location_updater.py, the correct approach
is to skip EPLB after 'EPLB due to rank faults' and let the expert
weights be synced through normal Mooncake EP operation.

See: sgl-project#15771
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 50203d8 to 3cd8a7f Compare March 10, 2026 03:17
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 3cd8a7f to c1e0869 Compare March 17, 2026 06:18
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 4ad0286 to 4fdd9fe Compare March 28, 2026 08:21
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 3 times, most recently from 8477c94 to 081c992 Compare April 7, 2026 05:44
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026
@UNIDY2002 UNIDY2002 marked this pull request as ready for review April 7, 2026 06:21
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 081c992 to 6b2f08e Compare April 7, 2026 07:49
@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for elastic Expert Parallel (EP) rank recovery and rejoining, specifically for the Mooncake backend. Key changes include the addition of an --elastic-ep-rejoin flag, implementation of rank recovery and rejoining logic in elastic_ep.py, and the integration of recovery checks within the ModelRunner forward pass. The review feedback highlights several critical issues regarding potential deadlocks caused by non-deterministic polling and inconsistent group iteration across ranks. Additionally, there are recommendations to optimize performance in the inference critical path and to ensure that metadata broadcasting safely handles scenarios where the default source rank is the one being recovered.

UNIDY2002 and others added 2 commits April 12, 2026 14:49
- Sort groups by unique_name in _iter_live_parallel_groups to ensure
  consistent iteration order across ranks during collective recovery
- Add early return in maybe_recover_ep_ranks when all ranks are healthy
  to avoid unnecessary CPU tensor copy overhead
- Add comment explaining Mooncake consistency guarantee for recovery
  decision synchronization
- Add comment in broadcast_global_expert_location_metadata explaining
  caller responsibility for ensuring src_rank is healthy
- Mark background thread as daemon in data_parallel_controller.py
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 9108d84 to a16025b Compare April 12, 2026 07:40
Copy link
Copy Markdown
Contributor Author

@UNIDY2002 UNIDY2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to Review Comments

Thank you for the detailed review. I've addressed the valid concerns and added explanations for the others.

Comment 1 (Critical) - Recovery synchronization

This is not an issue. The Mooncake EP backend's internal semantics guarantee that all ranks observe consistent peer readiness state through its own synchronization mechanisms. The polling appears local but is actually coordinated by the library. I've added a comment in the code explaining this: python/sglang/srt/model_executor/model_runner.py:1445

Comment 2 (High) - Dictionary iteration order

Fixed: _iter_live_parallel_groups now sorts groups by unique_name to ensure consistent iteration order across ranks during collective operations. See python/sglang/srt/elastic_ep/elastic_ep.py:106-111.

Comment 3 (High) - CPU copy overhead

Fixed: Added early return when active_ranks.all() and active_ranks_cpu.all() to avoid unnecessary tensor copies in the healthy case. See python/sglang/srt/model_executor/model_runner.py:1435-1436.

Comment 4 (Medium) - src_rank=0 safety

The function signature keeps the default for backward compatibility, but I've added a docstring comment explaining the caller's responsibility to ensure the source rank is healthy. In recovery scenarios, this function is called after all ranks have recovered.

Comment 5 (Medium) - Daemon thread

Fixed: Added daemon=True to the background thread. See python/sglang/srt/managers/data_parallel_controller.py:356.

Comment 6 (Medium) - ranks[0] for broadcast

No change needed. At this point in the recovery flow, all ranks (including rank 0) have successfully recovered, so using ranks[0] as the source is correct. Changing this could cause inconsistency with tp_worker's broadcast_pyobj synchronization.

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements rank recovery for elastic Expert Parallelism (EP), enabling relaunched ranks to rejoin active process groups. It introduces the --elastic-ep-rejoin argument and adds logic to ModelRunner to detect inactive ranks, poll for their recovery via the Mooncake backend, and synchronize state across the cluster. Feedback identifies a bug in global rank calculation during seed broadcasting and potential issues with metadata synchronization if rank 0 is recovered. Additionally, improvements are suggested to prevent shared memory leaks when re-initializing message queues and to manage the lifecycle of the ZMQ background thread to avoid port binding conflicts.

- Fix critical: use get_world_group().rank for correct global rank calculation
- Fix high: select healthy rank as broadcast source instead of defaulting to rank 0
- Fix medium: close existing mq_broadcaster before recreating to prevent shared memory leaks
@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements rank recovery and rejoining for elastic Expert Parallel (EP) using the Mooncake backend. It introduces the --elastic-ep-rejoin flag and adds logic to synchronize process groups, broadcast expert location metadata, and reset state managers when a rank is relaunched. Key changes include updates to the distributed parallel state to handle recovered ranks, new recovery helpers in elastic_ep.py, and the integration of a health-check and recovery mechanism within the ModelRunner forward pass. Feedback focuses on a performance bottleneck caused by synchronous device-to-host transfers during the per-iteration health check and a potential logic bug where active rank tensors are not reset after a successful recovery, which could lead to infinite recovery loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant