Feat: Integrate FBGEMM into EPMoE#7123
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request is a work-in-progress effort to integrate the FBGEMM library into the Expert Parallel Mixture of Experts (EPMoE) layer. The goal is to leverage FBGEMM for potentially faster grouped matrix multiplication operations, including support for FP8 quantized weights. This adds an alternative backend for the core GEMM operation within the MoE layer.
Highlights
- FBGEMM Integration: Integrated FBGEMM as an alternative grouped GEMM implementation within the EPMoE layer, specifically in the
GroupedGemmRunner. - FP8 Support: Added support for FP8 rowwise quantization when using the FBGEMM grouped GEMM path.
- Configuration: Introduced a
use_fbgemmboolean flag inGroupedGemmRunnerto enable or disable the FBGEMM implementation. - Typo Fixes: Corrected a consistent typo (
preproesstopreprocess) in the function namerun_moe_ep_preprocessacross multiple files.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
2b8b22b to
9d65496
Compare
There was a problem hiding this comment.
Code Review
This PR integrates FBGEMM into the EPMoE layer. The changes include adding new import paths, updating the GroupedGemmRunner to use FBGEMM, and enabling this path in the EPMoE layer. Typos related to preprocess and wrapper have also been corrected.
There was a problem hiding this comment.
We should assert block_shape is None now.
|
The e2e test no pass. Investigating. Expected: |
|
It's really strange, because the benchmark test with verify_data passed. |
|
Just wanted to again note that another way to use FBGEMM kernels is through its pip package: |
|
Currently, the tricky part is in side m_sizes, some element can be 0, which causes fbgemm working incorrectly, because the tiling mechanism in fbgemm is pre-configured, skipping some tile will cause tiling iterator mismatch. |
|
Made some progress. Now in tp=1, disable-cuda-graph, the result it correct. The promote: But if tp>1 or enable-cuda-graph, it will hang at the following step: |
4ad338d to
857a55f
Compare
|
The culprit is the c.shape. But the internal reason is related with the prepare fbgemm input part, related with the K. |
|
For a qwen3-moe model, the output shape of each layer should be like following: But in my case, the a.shape[-1] is 210 which is not correct, the reason is the previous layer's c.shape is incorrect. Investigating the final root cause. |
857a55f to
160cc5b
Compare
|
Let us know how we can help, happy to add support if needed. I think the 0-sized groups you've described should be supported. We've tested similar workloads and actually have a few cool optimizations we do for them in the cutlass version of the kernel. If they still are causing problems after your debug the shape issue we can take a look. |
Thanks @jwfromm . Per current investigation, the actual root cause is: the handling of m_sizes and b in prepare_fbgemm_inputs disrupts the expected tensor layout across multiple forward layers, especially in scenarios involving a shared expert pool (such as in Qwen3, where multiple MoE layers share b). The reason why the Triton version does not encounter this issue is it does not perform reshaping or filtering of invalid experts. Instead, it uses seg_indptr and weight_indices internally to mask or skip invalid experts. Expert selection is determined dynamically at runtime, so there is no need to reshape b in advance. To be more specific, it uses compute_m_range() to map the token index range and expert id. Some more details about my design. The reason to introduce prepare_fbgemm_inputs is because the original fbgemm input preparation can not pass under "tp>1 and enable cuda-graph". So I try to padding m_sizes which is filtered. But I believe the m_sizes calculation breaks original b which cause the problem. I think fbgemm support m_sizes[i] == 0, otherwise the following change(my first version) would not be working in tp=1, disable-cuda-graph: |
|
The current solution TP>1 will not work. The reason is because SGLang Triton kernel and FBGEMM kernel mechanism are different.
In TP>2, if enable-ep-moe, actually is EP=2, SGLang Triton kernel's shape is like the following: While in FBGEMM kernel, the mechanism is different. I use the following approach to calculate the m_sizes which fbgemm requires. But in TP>=2, weight_column_major=True, the shape of weight tensor needs to be reshaped. Still strive to fix it. Any input comment is welcome. |
|
Make some progress, now TP2 EP2 failed in empty tensor handling. |
|
The new prepare function added reshuffle expert index logic. Will push the new prepare function when tp2 is fully supported. |
160cc5b to
163a5b2
Compare
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as spam.
This comment was marked as spam.
|
Update progress, found the root cause of this issue, a tensor's base ptr should be updated in TP2. Now: tp2 disable-cuda-graph no crash, result correct, passed. With the fix, the TP2 result (disable-cuda-graph) is correct now. The remaining issue is: |
5045aa0 to
fbe8ace
Compare
|
|
I added several Triton kernels to fix the CUDA graph capture issues introduced by "a" tensor shape change. |
The latest issue about a tensor slicing has been fixed. |
@jianyuh Hi, could you please provide help about the bug, thanks! |
dd17e46 to
732fa10
Compare
|
The CUDA graph capture problem has been fixed without modifying FBGEMM kernel. tp1 disable-cuda-graph no crash, result correct, passed. tp2 disable-cuda-graph no crash, result correct, passed. tp4 disable-cuda-graph no crash, result correct, passed. tp8 disable-cuda-graph no crash, result correct, passed. |
@jianyuh The FBGEMM kernel is stable even if m_sizes.sum() == 0. I modified the host side logic and introduced some Triton kernels to handle the "TiledCopy" for C tensor. Now the problem is resolved. |
|
@yuan-luo Thanks for the workaround for m_sizes.sum() == 0 vs. cuda graph compatibility issues! Will follow up in FBGEMM side. cc @levendlee |
|
Benchmark SGLang Triton and FBGEMM in SGLang E2E. We can see many printing in FBGEMM benchmark, it seems it's related with jit and auto-tune for FBGEMM. It might make the TTFT slow. Triton: FBGEMM: @jianyuh @jwfromm Is there any configuration wrong in the steps? |
@zhyncs Current ep moe (cuda) supports deepgemm_contiguous, deepgemm_mask and flashinfer_cutedsl types. Fbgemm can be another alternative. I'll check whether it is feasible to fit into the new arch. |

Motivation
Background
Thanks to @BBuf in #6924, introducing Meta's FBGEMM in SGLang.
Now, this PR is to integrate FBGEMM BF16 into EPMoE. In the following PRs we will introduce FBGEMM FP8 and warp specialization features into SGLang EPMoE.
Modifications
This PR handle several issues:
SGLang tensor shapes:
a.shape(activation): [M , K]
b.shape(weight): [G , N , K]
c.shape(output of each layer): [M , N]
FBGEMM tensor shapes:
a.shape(activation): [M , K]
b.shape(weight): [G/tp_size, N, K]
c.shape(output of each layer): [M , N]
Triton GEMM leverages seg_indptr and weight_indices to calculate the tile-id to the expert-id and the token index range.
FBGEMM has no mapping of weight_indices and seg_indptr. The method to decide expert-id is m_sizes. FBGEMM flattens b tensor into a [G * N, K] matrix. It requires the expert group with token be aligned with the weight tensor flattened. Moreover, in TP>1, it makes things worse.
Due to A tensor's baseline point need to be adjusted in TP>1, the A tensor needs to be sliced in order to handle partition-wise GEMM calculation. CUDA graph capture is therefore broken. We introduced an innovative way, through adjusting valid tile-id in newly-introduced Triton kernel, then in reduce-sum phase only copy-back valid portions into C tensor, making sure A tensor's shape is not changed along with TP partitioning, which resolves the CUDA graph capture error.
Checklist