Skip to content

[Roadmap] MoE Refactor #8715

@ch-wan

Description

@ch-wan

Background

We aim to optimize the code structure of MoE modules in SGLang to enhance extensibility. Currently, there are three main MoE modules: FusedMoE, EPMoE, and DeepEPMoE (along with recent additions like FlashInferEPMoE and FlashInferFusedMoE as of this document's preparation). Their implementations suffer from several issues:

  • Inconsistent logic flow. Computation logics vary across modules. For instance, FusedMoE computes select_experts within its forward function, while DeepEPMoE handles it externally. Similarly, some forward functions manage routed_scaling_factor internally, but others do not.
  • Poor extensibility. We plan to support multiple all-to-all communication backends under EP (e.g., DeepEP, PPLX, etc) and grouped-GEMM backends (e.g., Triton, DeepGEMM, Triton Kernels, FlashInfer MoE, etc). The current design requires a dedicated forward function for each backend combination, leading to redundancy.
  • Lengthy and duplicated code. Common variable combinations are repeated across functions. For example, over 10 MoE quantization methods each handle about 15 nearly identical inputs in their apply functions. DeepEP dispatch outputs (8 in total) are duplicated in multiple model files.

Design

To streamline the code structure, we will deprecate all MoE modules except FusedMoE and gradually merge existing functionalities into it. Below is an overview of the target code structure:

[input_hidden_states]
          |
          v
     TopK.forward -> `select_experts` / `triton_kernels.routing` / bypass
          |
          V
     [TopKOutput]
          |
          v
   FusedMoE.forward -> Dispatcher.dispatch -> DeepEP / PPLX / bypass
          |                     |
          |                     v
          |              [DispatchOutput]
          |                     |
          |                     v
          |             quant_mothod.apply -> MoeRunner.forward -
          |                     |                                |
          |                     |                                v
          |                     |            pre-permute + grouped_gemm + post-permute
          |                     |                                |
          |                     |--------------------------------
          |                     v
          |               [CombineInput]
          |                     |
          |                     v
          |            Dispatcher.combine -> DeepEP / PPLX / bypass
          |                     |
          |---------------------                  
          v
[final_hidden_states]

In addition to existing arguments like --quantization, we will introduce --moe-a2a-backend and --moe-runner-backend to allow users to select the optimal dispatching and grouped-GEMM backends for their use cases.

If a developer wants to support a new backend, they only need to implement the Dispatcher or grouped-GEMM logic and define the input/output formats. A PermuteMethodPool will automatically select appropriate pre-permute and post-permute functions for layout conversions (if required). Developers can also register new permute functions for unsupported layouts. The TopK forward method will be automatically determined based on the backend arguments.

Tasks

The refactoring process is divided into three stages around MoeRunner.forward: preparation, implementation, and adoption.

Stage 1: Preparation

This stage focuses on unifying computation structures across all MoE modules and their forward functions, while wrapping dependent variables for better organization.

Stage 2: Implementation

In this stage, we will implement the MoeRunner framework.

Stage 3: Adoption

The third stage gradually adopts the new framework and replaces existing implementations with the unified structure. This incremental approach allows new grouped-GEMM backends to be merged during refactoring, as long as they are functional and non-invasive.

For MoE backends implemented in quantization files, we need to check the apply method (or apply_with_router_logits / apply_without_routing_weights) and distribute the implementation to the corresponding MoE backend files. Here is the tentative plan for reorganizing the current implementation.

Some MoE backends are implemented as a separate NN module. Their implementation should be scattered into the corresponding MoE backend and quantization files.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions