Guard fused Marlin MoE atomic add usage to prevent crashes#13262
Guard fused Marlin MoE atomic add usage to prevent crashes#132620xymoro wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @0xymoro, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical stability issue within the Marlin-quantized Mixture-of-Experts (MoE) implementation. It introduces a mechanism to guard the usage of an experimental atomic-add reduction path, which previously caused crashes when invoked with large matrix shapes that exceeded safe operational limits. By integrating a proven heuristic, the system now intelligently determines when to enable this reduction path, ensuring robust and safe execution of MoE operations without altering the core CUDA kernels. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a crash in Marlin-quantized MoE layers by introducing a crucial safety check for the atomic-add reduction path. The change correctly replaces a simplistic condition with a more robust heuristic, should_use_atomic_add_reduce, to determine when it's safe to use atomic add. This check is thoughtfully applied to each of the two GEMM stages with their respective tensor shapes. The implementation is clean, correct, and directly solves the illegal memory access issue. The changes are excellent and I have no suggestions for improvement.
|
Thanks, could plz help fix lint? |
|
Actually investigating something else - this may not be root cause. turning into draft for now |
|
This may actually be the fix - still running some tests. Challenging since it needs high batchsizes over course of hours to see if it nondeterministically crashes. Will update/report back if it works. |
|
I got to a point where the marlin kernel kind of overflowed with large batchsizes. tracing thru but it's a slog, and I worked off Hopper and yes reproducible on large bsz Current best guess and try: Root cause: the second Marlin GEMM (moe_wna16_marlin_gemm that multiplies expert outputs back into the model space) was receiving Our fix: scale the workspace with the actual amount of routed work—max(2N, K) // 64 * (# routed token blocks)—and only fall back to [2025-11-15 02:04:00 TP2] marlin stage2: M=131072 N=7168 K=256 use_atomic=False |
|
is there any query to reproduce or fix for this issue? we are observing this crash #13234 on 8xB200 with image 0.5.5.post3 for Kimi K2 Thinking model |
|
I don't think this change is the crux of the fix though (since on some tests 0.5.5 post2 with this still crashed) and it's likely something else. still investigating |
Motivation
Marlin‑quantized MoE crash because
sgl_kernel.fused_moealways forces the experimental atomic-add reduction path. Those large (M, N, K) shapes exceed the safe window for the Marlin kernel and trigger illegal memory accesses. We already have a proven heuristic elsewhere in the codebase; this PR reuses it so atomic add is only enabled when the shape/device combination is known safe.Modifications
should_use_atomic_add_reduceinsidesgl_kernel/python/sgl_kernel/fused_moe.py.(M, 2N, K), second uses(M·topk, K, N)) and pass the result asuse_atomic_addto eachmoe_wna16_marlin_gemmcall.