[Roadmap] MoE Refactor

# Background
We aim to optimize the code structure of MoE modules in SGLang to enhance extensibility. Currently, there are three main MoE modules: `FusedMoE`, `EPMoE`, and `DeepEPMoE` (along with recent additions like `FlashInferEPMoE` and `FlashInferFusedMoE` as of this document's preparation). Their implementations suffer from several issues:

- **Inconsistent logic flow.** Computation logics vary across modules. For instance, `FusedMoE` computes `select_experts` within its forward function, while `DeepEPMoE` handles it externally. Similarly, some forward functions manage `routed_scaling_factor` internally, but others do not.
- **Poor extensibility.** We plan to support multiple all-to-all communication backends under EP (e.g., DeepEP, PPLX, etc) and grouped-GEMM backends (e.g., Triton, DeepGEMM, Triton Kernels, FlashInfer MoE, etc). The current design requires a dedicated forward function for each backend combination, leading to redundancy.
- **Lengthy and duplicated code.** Common variable combinations are repeated across functions. For example, over 10 MoE quantization methods each handle about 15 nearly identical inputs in their `apply` functions. DeepEP dispatch outputs (8 in total) are duplicated in multiple model files.

# Design
To streamline the code structure, we will deprecate all MoE modules except `FusedMoE` and gradually merge existing functionalities into it. Below is an overview of the target code structure:

```
[input_hidden_states]
          |
          v
     TopK.forward -> `select_experts` / `triton_kernels.routing` / bypass
          |
          V
     [TopKOutput]
          |
          v
   FusedMoE.forward -> Dispatcher.dispatch -> DeepEP / PPLX / bypass
          |                     |
          |                     v
          |              [DispatchOutput]
          |                     |
          |                     v
          |             quant_mothod.apply -> MoeRunner.forward -
          |                     |                                |
          |                     |                                v
          |                     |            pre-permute + grouped_gemm + post-permute
          |                     |                                |
          |                     |--------------------------------
          |                     v
          |               [CombineInput]
          |                     |
          |                     v
          |            Dispatcher.combine -> DeepEP / PPLX / bypass
          |                     |
          |---------------------                  
          v
[final_hidden_states]
```

In addition to existing arguments like `--quantization`, we will introduce `--moe-a2a-backend` and `--moe-runner-backend` to allow users to select the optimal dispatching and grouped-GEMM backends for their use cases.

If a developer wants to support a new backend, they only need to implement the `Dispatcher` or grouped-GEMM logic and define the input/output formats. A `PermuteMethodPool` will automatically select appropriate `pre-permute` and `post-permute` functions for layout conversions (if required). Developers can also register new permute functions for unsupported layouts. The TopK forward method will be automatically determined based on the backend arguments.

# Tasks
The refactoring process is divided into three stages around `MoeRunner.forward`: preparation, implementation, and adoption.

## Stage 1: Preparation
This stage focuses on unifying computation structures across all MoE modules and their forward functions, while wrapping dependent variables for better organization.

- [ ] Structure modification
  - [x] Move all `select_experts` computations outside MoE modules. #7966 
  - [x] Move all all-to-all communication (for dispatch and combine) inside MoE modules. #8421 
  - [ ] Move all `routed_scaling_factor` multiplications inside MoE modules.
  - [x] Unify weight loading and quantization methods across all MoE modules. #8397 
  - [x] Unify Triton kernels for `FusedMoE` and `EPMoE`. #8515 
- [x] Variable wrap-up
  - [x] TopK config (e.g., `use_grouped_topk`, `renormalize`) and TopK output. #7966 
  - [x] Dispatch output. #8421 
- [x] Server args update
  - [x] Support `--moe-a2a-backend`. #8658 

## Stage 2: Implementation
In this stage, we will implement the `MoeRunner` framework.

- [x] Implement the framework. #9269 
- [x] Variable wrap-up
  - [x] MoE model config (e.g., `activation`, `no_combine`). #8849 
  - [x] Quantization utils (e.g., `input_scale`). #9269 
  - [x] Combine input. #9269
- [x] Update server args
  - [x] Support `--moe-runner-backend`.
    - #8849 

### Stage 3: Adoption
The third stage **gradually** adopts the new framework and replaces existing implementations with the unified structure. This incremental approach allows new grouped-GEMM backends to be merged during refactoring, as long as they are functional and non-invasive.

For MoE backends implemented in quantization files, we need to check the `apply` method (or `apply_with_router_logits` / `apply_without_routing_weights`) and distribute the implementation to the corresponding MoE backend files. Here is the tentative plan for reorganizing the current implementation.
- [x] `awq.py`
  - [x] `marlin.py` #14554
- [ ] blockwise_int8.py
  - [x] `triton.py` #9269
- [ ] `fp8.py`
  - [ ] `intel_amx.py`
  - [ ] `aiter.py`
  - [ ] `cutlass.py` #12023 
  - [x] `triton.py` #9269
  - [x] `flashinfer_trtllm.py` #15151 
- [x] `gptq.py`
  - [x] `marlin.py` #14554
- [ ] `modelopt_quant.py`
  - [x]  `triton.py` #9269
  - [x] `flashinfer_trtllm.py` #16685
  - [ ] `flashinfer_cutlass.py`
  - [ ] `cutlass.py` #12023 
  - [ ] `flashinfer_cutedsl.py`
- [ ] `moe_wna16.py`
  - [x] `triton.py` #9269
- [ ] `mxfp4.py`
  - [ ] `flashinfer_trtllm.py`
  - [x] `triton_kernels.py` #11795 
  - [x] `triton.py` #9269
  - [ ] `aiter.py`
- [ ] `unquant.py`
  - [x] `triton_kernels.py` #11795 
  - [ ] `aiter.py`
  - [x] `triton.py` #9269
  - [ ] `intel_amx.py`
  - [ ] `torch_native.py`
  - [ ] `npu.py`
- [ ] w4afp8.py
  - [ ] `cutlass.py` #12023 
- [x] `w8a8_fp8.py`
  - [x] `triton.py` #9269
- [ ] `w8a8_int8.py`
  - [ ] `intel_amx.py`
  - [x] `triton.py` #9269
  - [ ] `npu.py`

Some MoE backends are implemented as a separate NN module. Their implementation should be scattered into the corresponding MoE backend and quantization files.
- [ ] `FlashInferFusedMoE.forward` -> `flashinfer_trtllm.py` + `fp8.py`
- [ ] `FlashInferFP4MoE.forward` -> `flashinfer_trtllm.py` + `modelopt_quant.py`
- [x] `EPMoE.forward_deepgemm` -> `deep_gemm.py` + `fp8.py` #11211 
- [ ] `DeepEPMoE.forward_*`
  - [x] `deep_gemm.py` + `fp8.py` #12054 
  - [ ] `aiter.py` + `fp8.py`
  - [ ] `flashinfer_cutedsl.py` + `modelopt_quant.py`
  - [ ] `npu.py` + `fp8.py`


 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] MoE Refactor #8715

Background

Design

Tasks

Stage 1: Preparation

Stage 2: Implementation

Stage 3: Adoption

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] MoE Refactor #8715

Description

Background

Design

Tasks

Stage 1: Preparation

Stage 2: Implementation

Stage 3: Adoption

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions