EP Support for Piecewise Cuda Graph#14164
Conversation
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Summary of ChangesHello @Oasis-Git, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the SGLang runtime by integrating Mixture-of-Experts (MoE) layers into the piecewise CUDA graph execution pipeline. The primary goal is to leverage CUDA graphs for improved performance of Fused MoE operations, with a focus on standard top-k output formats for compatibility. Additionally, the changes include a refactoring of MoE output handling for better type management and the introduction of NPU-specific optimizations for the Qwen3MoE model, broadening hardware support and efficiency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for Expert Parallelism in Piecewise CUDA Graphs, which is a significant feature for optimizing MoE models. The changes are mostly well-structured, touching upon the forward context, MoE layers, and model execution runners. My review highlights a critical bug that could lead to infinite recursion, a TypeError due to incorrect API usage, and several medium-severity maintainability issues related to code duplication and brittle logic. Addressing these points will improve the robustness and long-term health of the codebase.
| def forward(self, hidden_states: torch.Tensor, topk_output: TopKOutput): | ||
| if is_in_piecewise_cuda_graph(): | ||
| assert TopKOutputChecker.format_is_standard( | ||
| topk_output | ||
| ), "Only standard topk output is supported for piecewise cuda graph" | ||
| return torch.ops.sglang.moe_forward_piecewise_cuda_graph_impl( | ||
| hidden_states, | ||
| topk_output.topk_weights, | ||
| topk_output.topk_ids, | ||
| topk_output.router_logits, | ||
| self.layer_id, | ||
| ) | ||
| else: | ||
| return self.forward_impl(hidden_states, topk_output) | ||
|
|
There was a problem hiding this comment.
This forward method is identical to the one in the base class FusedMoE. You can remove this duplicated implementation from FlashInferFusedMoE and FlashInferFP4MoE to improve maintainability. The logic will be inherited from FusedMoE.
def forward_impl(self, hidden_states: torch.Tensor, topk_output: TopKOutput):
origin_hidden_states_dim = hidden_states.shape[-1]
assert self.quant_method is not None
dispatch_output = self.dispatcher.dispatch(
hidden_states=hidden_states, topk_output=topk_output
)
if _use_aiter and self.dispatcher.local_expert_mapping is not None:
self.expert_mask_gpu = (
(
(self.dispatcher.local_expert_mapping >= 0)
& (self.dispatcher.local_expert_mapping < self.num_local_experts)
)
.to(torch.int32)
.to(device="cuda")
)
combine_input = self.run_moe_core(
dispatch_output=dispatch_output,
)| def forward(self, hidden_states: torch.Tensor, topk_output: TopKOutput): | ||
| if is_in_piecewise_cuda_graph(): | ||
| assert ( | ||
| topk_output.format() == TopKOutputFormat.STANDARD | ||
| ), "Only standard topk output is supported for piecewise cuda graph" | ||
| return torch.ops.sglang.moe_forward_piecewise_cuda_graph_impl( | ||
| hidden_states, | ||
| topk_output.topk_weights, | ||
| topk_output.topk_ids, | ||
| topk_output.router_logits, | ||
| self.layer_id, | ||
| ) | ||
| else: | ||
| return self.forward_impl(hidden_states, topk_output) | ||
|
|
| def forward(self, hidden_states: torch.Tensor, topk_output: TopKOutput): | ||
| if is_in_piecewise_cuda_graph(): | ||
| assert ( | ||
| topk_output.format() == TopKOutputFormat.STANDARD | ||
| ), "Only standard topk output is supported for piecewise cuda graph" | ||
| return torch.ops.sglang.moe_forward_piecewise_cuda_graph_impl( | ||
| hidden_states, | ||
| topk_output.topk_weights, | ||
| topk_output.topk_ids, | ||
| topk_output.router_logits, | ||
| self.layer_id, | ||
| ) | ||
| else: | ||
| return self.forward_impl(hidden_states, topk_output) | ||
|
|
| moe_block = None | ||
| if hasattr(layer, "mlp") and hasattr(layer.mlp, "experts"): | ||
| moe_block = layer.mlp.experts | ||
| if hasattr(layer, "block_sparse_moe") and hasattr( | ||
| layer.block_sparse_moe, "experts" | ||
| ): | ||
| moe_block = layer.block_sparse_moe.experts | ||
| if hasattr(layer, "moe") and hasattr(layer.moe, "experts"): | ||
| moe_block = layer.moe.experts | ||
| self.moe_layers.append(moe_block) |
There was a problem hiding this comment.
This series of if statements to find the moe_block can be brittle. If a layer happens to have more than one of these MoE attributes (e.g., both mlp.experts and block_sparse_moe.experts), the moe_block variable will be overwritten, and only the last one found will be used. Using if/elif/else would make the priority explicit and the code more robust.
| moe_block = None | |
| if hasattr(layer, "mlp") and hasattr(layer.mlp, "experts"): | |
| moe_block = layer.mlp.experts | |
| if hasattr(layer, "block_sparse_moe") and hasattr( | |
| layer.block_sparse_moe, "experts" | |
| ): | |
| moe_block = layer.block_sparse_moe.experts | |
| if hasattr(layer, "moe") and hasattr(layer.moe, "experts"): | |
| moe_block = layer.moe.experts | |
| self.moe_layers.append(moe_block) | |
| moe_block = None | |
| if hasattr(layer, "mlp") and hasattr(layer.mlp, "experts"): | |
| moe_block = layer.mlp.experts | |
| elif hasattr(layer, "block_sparse_moe") and hasattr( | |
| layer.block_sparse_moe, "experts" | |
| ): | |
| moe_block = layer.block_sparse_moe.experts | |
| elif hasattr(layer, "moe") and hasattr(layer.moe, "experts"): | |
| moe_block = layer.moe.experts | |
| self.moe_layers.append(moe_block) |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
/tag-and-rerun-ci |
|
@Oasis-Git Could you add a unit test? |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
/tag-and-rerun-ci |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Ann <yuanlai444@gmail.com>
Signed-off-by: Ann <yuanlai444@gmail.com>
Signed-off-by: Ann <yuanlai444@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
| _forward_context.set_forward_batch(forward_batch) | ||
| _forward_context.set_attention_layers(attention_layers) | ||
| _forward_context.set_quant_config(quant_config) | ||
| _forward_context.set_moe_layers(moe_layers) |
There was a problem hiding this comment.
These setters are redundant. Can you just put them in the __init__ of ForwardContext?
| hidden_states: torch.Tensor, | ||
| topk_output: TopKOutput, | ||
| ): | ||
| if is_in_piecewise_cuda_graph(): |
There was a problem hiding this comment.
Normally, we put the comment branch as the first branch of the if/else statements to increase the code readability.
So in this case,
It should be
if not is_in_piecewise_cuda_graph():
...
else:
...| f"Use Sliding window memory pool. full_layer_tokens={self.full_max_total_num_tokens}, swa_layer_tokens={self.swa_max_total_num_tokens}" | ||
| ) | ||
|
|
||
| def can_run_piecewise_cuda_graph(self): |
There was a problem hiding this comment.
Can we put these checks to server_args.py?
We would like to put the checks as early as possible and make the printed ServerArgs reflect most things.
| with freeze_gc(self.model_runner.server_args.enable_cudagraph_gc): | ||
| with set_compiled(True): |
There was a problem hiding this comment.
We can merge the two with in a single line.
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Motivation
Support EP for Piecewise Cuda Graph
Modifications
Following discussions with @ch-wan, @ispobock, and @BBuf, the Piecewise CUDA Graph wrapper for the MoE layer has been scoped to focus on Fused MoE and its derived classes. Currently, we primarily support standard top-k output, handling the necessary packing and unpacking operations for Torch library registration.
Also credit to: @byjiang1996
Accuracy Tests & Benchmark Output
For Deepseek-v2 model:
For gpt-oss 20B model:
For Qwen3-30B-A3B:
Checklist