MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py#16685
Merged
ch-wan merged 3 commits intosgl-project:mainfrom Feb 3, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci one more time? |
3b79e64 to
25c7e12
Compare
25c7e12 to
8ec5536
Compare
ch-wan
reviewed
Jan 28, 2026
| routing_method_type=routing_method_type, | ||
| ) | ||
|
|
||
| return fused_experts_none_to_flashinfer_trtllm_fp4( |
Collaborator
Author
There was a problem hiding this comment.
Yeah, I think it would be good. The problem is (to my undersatnding), since we can only register 1 fused function for flashinfer_trtllm, while there might either be trtllm_fp4_block_scale_moe, or fused_experts_none_to_flashinfer_trtllm_fp8`.
Therefore to simplify the codes I didn't, other wise I suggest maybe the diff below to get the flashinfer_trtllm to register two different fused backends. But I am open to either.
git --no-pager diff
diff --git a/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py b/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py
index 74c56761a..c2b037bda 100644
--- a/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py
+++ b/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py
@@ -207,7 +207,6 @@ class FlashInferTrtllmFp8MoeQuantInfo(MoeQuantInfo):
use_routing_scales_on_input: bool = False
-@register_fused_func("none", "flashinfer_trtllm")
def fused_experts_none_to_flashinfer_trtllm_fp8(
dispatch_output: StandardDispatchOutput,
quant_info: FlashInferTrtllmFp8MoeQuantInfo,
@@ -478,3 +477,21 @@ def fused_experts_none_to_flashinfer_trtllm_fp4(
)[0]
return StandardCombineInput(hidden_states=result)
+
+
+@register_fused_func("none", "flashinfer_trtllm")
+def fused_experts_none_to_flashinfer_trtllm(
+ dispatch_output: StandardDispatchOutput,
+ quant_info: MoeQuantInfo,
+ runner_config: MoeRunnerConfig,
+) -> StandardCombineInput:
+ """Dispatch to FP8 or FP4 FlashInfer TRT-LLM MoE based on quant_info type."""
+ if isinstance(quant_info, FlashInferTrtllmFp4MoeQuantInfo):
+ return fused_experts_none_to_flashinfer_trtllm_fp4(
+ dispatch_output, quant_info, runner_config
+ )
+ return fused_experts_none_to_flashinfer_trtllm_fp8(
+ dispatch_output,
+ cast(FlashInferTrtllmFp8MoeQuantInfo, quant_info),
+ runner_config,
+ )
diff --git a/python/sglang/srt/layers/quantization/modelopt_quant.py b/python/sglang/srt/layers/quantization/modelopt_quant.py
index a71c0bc65..3b13ce408 100755
--- a/python/sglang/srt/layers/quantization/modelopt_quant.py
+++ b/python/sglang/srt/layers/quantization/modelopt_quant.py
@@ -1496,6 +1496,10 @@ class ModelOptNvFp4FusedMoEMethod(FusedMoEMethodBase):
self, layer: torch.nn.Module, moe_runner_config: MoeRunnerConfig
):
self.moe_runner_config = moe_runner_config
+ if get_moe_runner_backend().is_flashinfer_trtllm():
+ self.runner = MoeRunner(
+ MoeRunnerBackend.FLASHINFER_TRTLLM, moe_runner_config
+ )
def apply(
self,
@@ -1514,11 +1518,11 @@ class ModelOptNvFp4FusedMoEMethod(FusedMoEMethodBase):
), f"{activation=} missing from {ACT_STR_TO_TYPE_MAP.keys()=}"
moe_runner_config = self.moe_runner_config
- # FlashInfer TRTLLM FP4 path - check if layer has shuffled weights
+ # FlashInfer TRTLLM FP4 path - layer has shuffled weights only when
+ # backend is flashinfer_trtllm
if hasattr(layer, "gemm1_weights_fp4_shuffled"):
from sglang.srt.layers.moe.moe_runner.flashinfer_trtllm import (
FlashInferTrtllmFp4MoeQuantInfo,
- fused_experts_none_to_flashinfer_trtllm_fp4,
)
from sglang.srt.layers.moe.utils import RoutingMethodType
@@ -1543,9 +1547,7 @@ class ModelOptNvFp4FusedMoEMethod(FusedMoEMethodBase):
routing_method_type=routing_method_type,
)
- return fused_experts_none_to_flashinfer_trtllm_fp4(
- dispatch_output, quant_info, moe_runner_config
- )
+ return self.runner.run(dispatch_output, quant_info)
if self.enable_flashinfer_cutlass_moe:
assert (bcf041f to
8a7d539
Compare
8a7d539 to
4c8b913
Compare
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
4c8b913 to
91c4216
Compare
ch-wan
approved these changes
Feb 3, 2026
hhu-scitix
pushed a commit
to scitix/sglang
that referenced
this pull request
Feb 3, 2026
…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
5 tasks
charlesHsuGG
pushed a commit
to charlesHsuGG/sglang
that referenced
this pull request
Feb 5, 2026
…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
sfiisf
pushed a commit
to sfiisf/sglang
that referenced
this pull request
Feb 5, 2026
…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Johnsonms
pushed a commit
to Johnsonms/sglang
that referenced
this pull request
Feb 14, 2026
…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
hhu-scitix
pushed a commit
to scitix/sglang
that referenced
this pull request
Feb 16, 2026
…gl-project#16685) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Followup on #15151 (comment), and part of #8715