MoE Refactor: Refactor `modelopt_quant.py` -> `flashinfer_trllm.py` by b8zhong · Pull Request #16685 · sgl-project/sglang

b8zhong · 2026-01-08T00:48:21Z

Motivation

Followup on #15151 (comment), and part of #8715

gemini-code-assist · 2026-01-08T00:48:25Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2026-01-08T00:48:38Z

/tag-and-rerun-ci one more time?

python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py

ch-wan · 2026-01-28T05:12:07Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+                routing_method_type=routing_method_type,
+            )
+
+            return fused_experts_none_to_flashinfer_trtllm_fp4(


why not using runner?

Yeah, I think it would be good. The problem is (to my undersatnding), since we can only register 1 fused function for flashinfer_trtllm, while there might either be trtllm_fp4_block_scale_moe, or fused_experts_none_to_flashinfer_trtllm_fp8`.

Therefore to simplify the codes I didn't, other wise I suggest maybe the diff below to get the flashinfer_trtllm to register two different fused backends. But I am open to either.

git --no-pager diff diff --git a/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py b/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py index 74c56761a..c2b037bda 100644 --- a/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py +++ b/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py @@ -207,7 +207,6 @@ class FlashInferTrtllmFp8MoeQuantInfo(MoeQuantInfo): use_routing_scales_on_input: bool = False -@register_fused_func("none", "flashinfer_trtllm") def fused_experts_none_to_flashinfer_trtllm_fp8( dispatch_output: StandardDispatchOutput, quant_info: FlashInferTrtllmFp8MoeQuantInfo, @@ -478,3 +477,21 @@ def fused_experts_none_to_flashinfer_trtllm_fp4( )[0] return StandardCombineInput(hidden_states=result) + + +@register_fused_func("none", "flashinfer_trtllm") +def fused_experts_none_to_flashinfer_trtllm( + dispatch_output: StandardDispatchOutput, + quant_info: MoeQuantInfo, + runner_config: MoeRunnerConfig, +) -> StandardCombineInput: + """Dispatch to FP8 or FP4 FlashInfer TRT-LLM MoE based on quant_info type.""" + if isinstance(quant_info, FlashInferTrtllmFp4MoeQuantInfo): + return fused_experts_none_to_flashinfer_trtllm_fp4( + dispatch_output, quant_info, runner_config + ) + return fused_experts_none_to_flashinfer_trtllm_fp8( + dispatch_output, + cast(FlashInferTrtllmFp8MoeQuantInfo, quant_info), + runner_config, + ) diff --git a/python/sglang/srt/layers/quantization/modelopt_quant.py b/python/sglang/srt/layers/quantization/modelopt_quant.py index a71c0bc65..3b13ce408 100755 --- a/python/sglang/srt/layers/quantization/modelopt_quant.py +++ b/python/sglang/srt/layers/quantization/modelopt_quant.py @@ -1496,6 +1496,10 @@ class ModelOptNvFp4FusedMoEMethod(FusedMoEMethodBase): self, layer: torch.nn.Module, moe_runner_config: MoeRunnerConfig ): self.moe_runner_config = moe_runner_config + if get_moe_runner_backend().is_flashinfer_trtllm(): + self.runner = MoeRunner( + MoeRunnerBackend.FLASHINFER_TRTLLM, moe_runner_config + ) def apply( self, @@ -1514,11 +1518,11 @@ class ModelOptNvFp4FusedMoEMethod(FusedMoEMethodBase): ), f"{activation=} missing from {ACT_STR_TO_TYPE_MAP.keys()=}" moe_runner_config = self.moe_runner_config - # FlashInfer TRTLLM FP4 path - check if layer has shuffled weights + # FlashInfer TRTLLM FP4 path - layer has shuffled weights only when + # backend is flashinfer_trtllm if hasattr(layer, "gemm1_weights_fp4_shuffled"): from sglang.srt.layers.moe.moe_runner.flashinfer_trtllm import ( FlashInferTrtllmFp4MoeQuantInfo, - fused_experts_none_to_flashinfer_trtllm_fp4, ) from sglang.srt.layers.moe.utils import RoutingMethodType @@ -1543,9 +1547,7 @@ class ModelOptNvFp4FusedMoEMethod(FusedMoEMethodBase): routing_method_type=routing_method_type, ) - return fused_experts_none_to_flashinfer_trtllm_fp4( - dispatch_output, quant_info, moe_runner_config - ) + return self.runner.run(dispatch_output, quant_info) if self.enable_flashinfer_cutlass_moe: assert (

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners January 8, 2026 00:48

github-actions bot added the quant LLM Quantization label Jan 8, 2026

github-actions bot added the run-ci label Jan 8, 2026

b8zhong force-pushed the brayden/refactor-modelopt-moe-flashinfer-trtllm branch from 3b79e64 to 25c7e12 Compare January 9, 2026 17:47

ch-wan self-assigned this Jan 16, 2026

b8zhong force-pushed the brayden/refactor-modelopt-moe-flashinfer-trtllm branch from 25c7e12 to 8ec5536 Compare January 18, 2026 17:41

ch-wan reviewed Jan 28, 2026

View reviewed changes

b8zhong force-pushed the brayden/refactor-modelopt-moe-flashinfer-trtllm branch 2 times, most recently from bcf041f to 8a7d539 Compare January 30, 2026 05:01

b8zhong added the format Auto Format Code label Jan 30, 2026

b8zhong force-pushed the brayden/refactor-modelopt-moe-flashinfer-trtllm branch from 8a7d539 to 4c8b913 Compare January 30, 2026 14:33

b8zhong and others added 3 commits January 30, 2026 14:57

rebase changes

3f434e8

Update python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py

bcf7c0a

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

more

91c4216

b8zhong force-pushed the brayden/refactor-modelopt-moe-flashinfer-trtllm branch from 4c8b913 to 91c4216 Compare January 30, 2026 19:57

ch-wan approved these changes Feb 3, 2026

View reviewed changes

ch-wan merged commit 78bf13d into sgl-project:main Feb 3, 2026
275 of 318 checks passed

hhu-scitix pushed a commit to scitix/sglang that referenced this pull request Feb 3, 2026

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (s…

107a7d6

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

iforgetmyname mentioned this pull request Feb 3, 2026

[CI][NPU] Bugfix import sgl-kernel error #18173

Merged

5 tasks

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (s…

50e1e38

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (s…

bde531d

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

b8zhong mentioned this pull request Feb 5, 2026

[Roadmap] MoE Refactor #8715

Open

66 tasks

b8zhong deleted the brayden/refactor-modelopt-moe-flashinfer-trtllm branch February 6, 2026 21:29

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (s…

a7d4185

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

hhu-scitix pushed a commit to scitix/sglang that referenced this pull request Feb 16, 2026

MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (s…

18467b5

…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Refactor: Refactor `modelopt_quant.py` -> `flashinfer_trllm.py`#16685

MoE Refactor: Refactor `modelopt_quant.py` -> `flashinfer_trllm.py`#16685
ch-wan merged 3 commits intosgl-project:mainfrom
bzhng-development:brayden/refactor-modelopt-moe-flashinfer-trtllm

b8zhong commented Jan 8, 2026

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Uh oh!

b8zhong commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

ch-wan Jan 28, 2026

Uh oh!

b8zhong Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

b8zhong commented Jan 8, 2026

Motivation

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Uh oh!

b8zhong commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ch-wan Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

b8zhong Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

b8zhong commented Jan 8, 2026 •

edited

Loading