Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next#11577
Conversation
|
Hi @mingfeima @jianan-gu , please take a look at this PR. |
3b868ff to
51b298c
Compare
mingfeima
left a comment
There was a problem hiding this comment.
change function name accordingly to FusedRMSNormGated
sglang/python/sglang/srt/models/qwen3_next.py
Line 285 in 3b1cc46
51b298c to
b39b007
Compare
The corresponding python frontend change is in https://github.com/sgl-project/sglang/pull/12525/files#diff-aa11f8749a10b44bb815fe4694d98df90bc5ab608fdfec94e99fb247a1c6ed6eR335. @jianan-gu Please update the func name according to this PR. |
|
@yanbing-j put more details in the PR descriptions,
|
|
@Alcanderian Could you please help merge this PR? The CI failures are not related to this PR change. Thanks! |
* [model-gateway] update workflow names for gateway and exclude npu (sgl-project#13415) * [Tiny fix] Fix bench_speculative.py run bug (sgl-project#13416) * [model-gateway] Add Gateway Release Tooling (sgl-project#13420) * fix uneven PP layer indices (sgl-project#13282) Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> * diffusion: fix wan2.2 ti2v num_frames adjust logic (sgl-project#13379) Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> * [PD][bug fix] fix memleak when last_batch is none (sgl-project#13144) Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> * Fix cache_tokens calculate issue when retracted (sgl-project#11900) Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com> Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com> * [feature] Custom base path on FastAPI server (sgl-project#5879) Co-authored-by: lianhu.yin <lianhu.yin@nio.com> Co-authored-by: kebyn <kebyn@kebyn.cc> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> * Adding user defined hooks support (sgl-project#13217) * Fix log time stats (sgl-project#13418) * [Ci tiny fix] Lower score threshold in evaluation test (sgl-project#13443) * diffusion: fix loading with local model_path (sgl-project#13445) * [2/N] CI refactor: sperate some backend-independent CPU tasks. (sgl-project#13447) * Temporarily disable model hooks CI (sgl-project#13450) * [Deepseek V3.2] Use torch.compile to speed up torch.cat in nsa (sgl-project#13022) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> * Remove verbs from GET endpoint paths to follow REST standards (sgl-project#13273) * Add missing models (sgl-project#13456) * extend sagemaker.Dockerfile serve script to allow all sglang serve flags (sgl-project#13173) * Fix 8-gpu B200 nightly tests (sgl-project#13457) * Fixes validation errors for Wan-AI models which store model weights in subdirectories (sgl-project#13461) * [Embeddings Performance Testing] Add performance test for embedding models (sgl-project#12359) * [NVIDIA] Fix broken fp8 MoE of deepseek v3 (sgl-project#13264) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Temporarily comment out multimodal gen test to recover runners (sgl-project#13463) * Update pr-test.yml to fix invalid job name error * Add interface_v1 option for dynamic HiCache backend (sgl-project#13140) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 (sgl-project#13455) * fix MambaPool clear method after refactoring (sgl-project#13449) * [AMD CI] Update sgl-router python path in dockerfile. (sgl-project#13458) * [CI] re-enable test_vision_openai_server_a ci (sgl-project#13444) * Adding CI Monitor Improvements (sgl-project#13462) * [GLM4.6v] Required changes for bumping up to transformer 5.x (sgl-project#13229) * [GLM4.6v] Relax the constraint of non-user role chat completion message schema for new GLM-v release (sgl-project#13258) * [model-gateway] use worker startup time out for worker registration (sgl-project#13473) * model: support JetVLM (sgl-project#13289) * chore: add an unified server arg for multimodal inputs preprocess config(sgl-project#12149) Co-authored-by: bianfeng <bianfeng@pinduoduo.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> * [PD] Clarify init method docstrings for kvsender and kvreceiver (sgl-project#13476) * Fix lora test (sgl-project#13479) * [Piecewise CUDA Graph] Support ModelOpt FP8 (sgl-project#13094) * CI: fix NFS EBUSY error in PR test workflow (sgl-project#13460) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [CI] fix triggered by a non-run-ci label (sgl-project#13393) * [CI] remove auto-labeling `run-ci` label. (sgl-project#13486) * fix: change performance log directory to cache path (sgl-project#13482) Co-authored-by: Mick <mickjagger19@icloud.com> * [CI] Add input for pr-gate (sgl-project#13491) * [opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel (sgl-project#13374) * [CI] fix lint yml (syntax error) (sgl-project#13496) * [VLM][feat] Support encoder DP for Qwen2.5-VL (sgl-project#13126) Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: liusy58 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> * [HiCache] Critical fix to host memory double free (sgl-project#13501) Co-authored-by: Hao Chen <cighao@gmail.com> * [BugFix] Accuracy and function Issue when run ptpc quant model (sgl-project#13157) Co-authored-by: yuechguo <yuechguo@amd.com> * fix: create git tags directly instead of temporary branches (sgl-project#13168) * Add .github/CI_PERMISSIONS.json to define the CI permissions (sgl-project#13509) Co-authored-by: sglang-bot <sglangbot@gmail.com> * README.md -> FOLDER_README.md (sgl-project#13510) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Use slash command to trigger CI (sgl-project#13512) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Add docs on trigger ci (sgl-project#13513) Co-authored-by: sglang-bot <sglangbot@gmail.com> * [Feature] Re:Enable hybrid mem saver (sgl-project#12962) * Trigger CI retry with edit (sgl-project#13516) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Update docs (sgl-project#13519) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Add /tag-and-rerun-ci (sgl-project#13521) * [CI] update pr-gate to be compatible with new slash triggering mananer. (sgl-project#13522) * [CI] fix skipping pr-gate on main (sgl-project#13525) * Small cleanups related to LoRA weight loading (sgl-project#13474) * [CI] fix CI skipped on main (sgl-project#13527) * [model-gateway] fix gateway docker build due to recent py code change (sgl-project#13532) * [model-gateway] limit opened files in docker build to fix edge case (sgl-project#13536) * [docker] fix dockerfile naming for diffusion (sgl-project#13534) * fix lora test (sgl-project#13537) * Remove jet-ai/Jet-Nemotron-2B in nightly text tests as this is constantly failing (sgl-project#13540) * [fix] Fixes accuracy issues caused by incorrect use of rope (sgl-project#13495) * Flashinfer TRTLLM-GEN-MoE + Qwen3 (sgl-project#13489) * [chore] Disable ccache for sgl-kernel release (sgl-project#13541) * Add Qwen/Qwen1.5-MoE-A2.7B to model list (sgl-project#13543) * [Fix] Fix DeepSeek V3 MTP on B200 (sgl-project#13548) * [router][grpc] Support num_reasoning_tokens in haromy models (sgl-project#13047) * [feat][Ascend][Mindspore]: support model-impl of mindspore (sgl-project#9234) * [AMD CI] Local cache fallback. (sgl-project#13452) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] fix amd 1 gpu basic test (sgl-project#13551) * [Doc] Update HiCache and Mooncake docs & Mooncake Setup Error Checking (sgl-project#12740) * purge unnecessary env variable set in deterministic test (sgl-project#13481) * chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13542) * Add `lmsys/gpt-oss-20b-bf16` to model validation check (sgl-project#13557) * CI Failure Monitor Improvements (sgl-project#13558) * [RL] Allow passing tensors of different dtypes for FlattenedTensorBucket (sgl-project#13413) * [CI] Fix CUDA workflow's dependency. (sgl-project#13568) * [NPU] Adapt pr-gate for pr-test workflow & workflows refresh (sgl-project#13567) * Tiny enhance test suites sanity check (sgl-project#13589) * [3/N] CI refactor: move some manually triggered tests. (sgl-project#13448) * Support moe topk sigmoid kernel (sgl-project#13049) Co-authored-by: xuebi <xuebi@minimaxi.com> * Expend compatibility check for all quantized MoE models (sgl-project#13465) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * add https://github.com/netanel-haber to CI_PERMISSIONS.json (sgl-project#13577) * chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13570) * [Auto Sync] Update base_grammar_backend.py, collector.py (20251116) (sgl-project#13357) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sehoon Kim <sehoon@x.ai> * [GDN] Remove unnecessary contiguous() (sgl-project#13604) * [GDN] Remove unnecessary conv state clone (sgl-project#13603) * [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL (sgl-project#13055) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Yuhao Yang <yhyang201@gmail.com> * [diffusion] CI: improve diffusion CI (sgl-project#13562) Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> * feat: support external custom models (sgl-project#13429) Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [CI fix] Fix image download failures in VLM CI tests (sgl-project#13613) * [NVIDIA] Add fp8 gemm benchmark on blackwell (sgl-project#13528) * [UT] Destroy process group after broadcast to resolve port occupation issues in multi-server tests (sgl-project#12379) * [diffusion] refactor: remove PreprocessorConfig (sgl-project#13248) * [diffusion] refactor: refactor pipeline folders (sgl-project#13253) * Add FP32 dtype support for RoPE - Part2 (sgl-project#13328) * [diffusion] fix: remove multimodal_gen redundant get_bool_env_var func (sgl-project#13583) Co-authored-by: Mick <mickjagger19@icloud.com> * Add support for new aiter version (AR accuracy, is_shuffled PR) (sgl-project#13554) Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com> * diffusion: improve baseline performance monitor (sgl-project#13614) * [Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel) (sgl-project#13453) * [CI] Align metric units for CI rate limit (sgl-project#13633) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCM] Optimized deepseek-r1 fp8 model with + triton_gemm_a8w8 + batch_gemm_a8w8 + fused set_mla_kv_buffer kernel (sgl-project#13617) Co-authored-by: root <root@smci355-ccs-aus-m12-17.cs-aus.dcgpu> Co-authored-by: jacky.cheng <yichiche@amd.com> * fix bench_speculative bug (sgl-project#13197) * Revert "[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel)" (sgl-project#13644) * [CI] optimize CI workflow info (sgl-project#13634) * CI: Kill zombie diffusion processes in CI & minor code style fix on rotary embedding fallback (sgl-project#13637) * [CI] apply pr-gate for XPU (sgl-project#13663) * Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next (sgl-project#11577) * [10/n] decouple quantization impl from vllm dependency - fix import (sgl-project#13524) * Adding nightly tests as release guard for bot bump workflows (sgl-project#13655) * [DeepseekV3.2] Deepseek fp8 support for MHA path (sgl-project#12964) * Fix launch of `Olmo3` (sgl-project#13666) Signed-off-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Deepseek V3.2] Change indexer weights_proj to fp32 (sgl-project#13459) * enable csgmv automatically on cuda (sgl-project#13600) * Add nightly test CI monitor workflow (sgl-project#13038) * allow loras to be implicitly evicted and loaded based on max_loaded_loras (sgl-project#11526) * Test reorganization: Move tests to manual/ (sgl-project#13610) * [Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 (sgl-project#13667) Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Oasis-Git <ayw.sirius19@gmail.com> * Super tiny remove unused MiniMaxM2MLP class (sgl-project#13659) * Update quantization.md with new model resources (sgl-project#13677) * [model-gateway] add both python and rust cli alias (sgl-project#13678) * [diffusion] CI: improve validation method (sgl-project#13627) * [model-gateway] fix gateway cli arg parser to not use = (sgl-project#13685) * [CI] Move nightly tests to test/nightly/ (sgl-project#13683) * [NVIDIA] Add cutedsl e2e test to GB200 CI (sgl-project#12672) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Add sgl-kernel CI test for Blackwell (B200) (sgl-project#13301) * remove unnecessary starvation check (sgl-project#13619) * Fix target MLA with eagle3 support for PD disaggregation (sgl-project#13555) Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com> Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com> * [kimi k2 thinking] Avoid useless torch.zeros_ (sgl-project#13596) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * [opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size (sgl-project#13587) * [VLM] Support Piecewise CUDA Graph for InternVL (sgl-project#13640) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Piecewise Cuda Graph] rename, refactor and add more logging (sgl-project#13675) Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Oasis-Git <ayw.sirius19@gmail.com> * [difusion] CI: speed up multimodal_gen ci (sgl-project#13665) Co-authored-by: Mick <mickjagger19@icloud.com> * [diffusion] doc: minor update docs (sgl-project#13177) * Fix ZMQ bind error on non-zero rank nodes when using SGLANG_BLOCK_NONZERO_RANK_CHILDREN=0 (sgl-project#13686) * [diffusion] server: use meta to avoid Linear init for TextEncoder (sgl-project#13564) Co-authored-by: Mick <mickjagger19@icloud.com> * [Auto Sync] Update http_server.py, io_struct.py, scheduler_... (20251120) (sgl-project#13679) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Zhuqi Li <zhli@x.ai> * [Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers (sgl-project#13590) Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> * [HiCache] fix unit test with changed new APIs (sgl-project#13498) * [Fix] Qwen3Next lmhead dtype (sgl-project#13708) * [NPU] chore: bump to CANN 8.3.RC1 and Pytorch 2.8.0 (sgl-project#13647) * [11/N] MoE Refactor: Simplifying SBO Implementation with Dispatcher Hooks (sgl-project#13327) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Clean code] Compressed_tensors_moe code clean (sgl-project#13719) * [diffusion] profile: support performance metric dumping and comparison (sgl-project#13630) * [AMD] Enable fused shared expert append and flatten quant for fp8 deepseekR1 model (sgl-project#13705) Co-authored-by: yctseng0211 <yctseng@amd.com> * [diffusion] doc: add contributing.md (sgl-project#13649) * fix 3fs down, lock schedule main thread (sgl-project#13407) * Fix url: use https://roadmap.sglang.io for roadmap (sgl-project#13733) Co-authored-by: sglang-bot <sglangbot@gmail.com> * Super tiny delete unused files (sgl-project#13734) * [diffusion] log: minor improve logging (sgl-project#13735) * [CI] minor hot fix of model validation list (sgl-project#13737) * Add to ci permission (sgl-project#13739) * [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) (sgl-project#13466) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix: CI monitor should not exit with error on regressions (sgl-project#13694) * Revert "enable csgmv automatically on cuda" (sgl-project#13707) * Support torch 12.9 + DeepEP by removing custom nvshmem (sgl-project#12949) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * add some more labels (sgl-project#13701) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * Feat/nemotron nano v3 support (sgl-project#12690) * Fix global scaling factor loading hang (sgl-project#13484) * Fix B200 Nightly tests and move one manual test back to unit test to prevent the same issue (sgl-project#13746) * fix test_lora_update.py starvation message check (sgl-project#13702) * Fix model weights validation with automatic cache cleanup (sgl-project#13729) * [Auto Sync] Update evict_policy.py, radix_cache.py (20251120) (sgl-project#13669) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: cctry <shiyang@x.ai> * [Tiny] Renaming environ for NVFP4 dispatch (sgl-project#13756) * modularize gsm8k and mmmu test classes (sgl-project#13506) * Use dual stream for DS MoE whenever cuda graph is used (instead of with token threshold) (sgl-project#9405) * [Ascend] support Kimi-K2-Thinking (sgl-project#12759) Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: richhuan <huan_rz@qq.com> Co-authored-by: ZhengdQin <46387172+ZhengdQin@users.noreply.github.com> * Refactor eagle bigram key matching (sgl-project#13714) * [diffusion] fix: fix hunyuanvideo and add 2-gpu ci test (sgl-project#13720) Co-authored-by: Mick <mickjagger19@icloud.com> * Update mem checker during busy (sgl-project#13704) * Tiny support different prompts in `send_one.py` (sgl-project#13768) * [diffusion] refactor: refactor sampling params (sgl-project#13706) * [VLM] Replace torch.repeat_interleave with faster np.repeat for Qwen-VL series (sgl-project#13736) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Spec v2] Remove `allocate_lens` and enable over-allocation (sgl-project#13478) * [diffusion] CI: tinyfix diffusion ci (sgl-project#13769) Co-authored-by: Mick <mickjagger19@icloud.com> * align code style eagle draft&draft_extend cuda graph runner (sgl-project#13533) * Refactor MHA & MLA KV caches to support FP4 (sgl-project#13547) Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> * Move unnecessary input_addr capture under debug mode flag for speed-up (sgl-project#13690) * Gather static input buffers for cuda graph (sgl-project#13676) * Revert "Fix RMSNorm API CALL mismatch issue. (sgl-project#10032)" (sgl-project#13727) * [model-gateway] update smg code owner (sgl-project#13777) * [model-gateway] clean up router manager function order (sgl-project#13776) * Fix typo in docs (sgl-project#13709) * [Feature] HiCache JIT kernel (once again) (sgl-project#13764) * [DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode (sgl-project#13787) * Upgrade flashmla kernel for NSA tp support (sgl-project#13718) * [diffusion] feat: support sp for image models (sgl-project#13180) * [diffusion] CI: add run_suite to multimodal_gen CI (sgl-project#13791) * Fix pagination bug in CI monitor preventing performance-test-2-gpu data collection (sgl-project#13781) * [Scheduler] Tiny organize code style (sgl-project#13806) * [Deepseek] Refactor deepseek server_args _handle_model_specific_adjustments (sgl-project#13687) * [CI] Tiny refactoring sgl-kernel tests (sgl-project#13813) * Tune fp8_w8a8 fused triton moe for GLM-4.6-FP8 (sgl-project#13815) * make trtllm attn backend's init_forward_metadat non blocking (sgl-project#13802) * remove package json which is not used (sgl-project#13810) * [1/2] Refactor DeepGeem requant for FP8 Linear on Blackwell (sgl-project#13601) Co-authored-by: fy1214 * chore: bump sgl-kernel version to 0.3.18 (sgl-project#13816) * xgrammar up version to 0.1.27 (sgl-project#13650) * Fix bug: Incorrect variable used in rem_total_token_offset calculatio… (sgl-project#13201) * [Doc] Refine fused_moe_triton configs doc (sgl-project#13820) * Update MindSpore documentation (sgl-project#13656) Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Refactor cache init logic (sgl-project#13800) * [Bugfix] Add jit kernel files in packaging (sgl-project#13829) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Xu Yongfei <xuyongfei.xyf@antgroup.com> * [diffusion] doc: minor update contributing.md with test section (sgl-project#13792) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [misc] Rename minilb install env & remove files & fix lint (sgl-project#13831) * [diffusion] CI: send nightly-test outputs of diffusion to slack for correctness monitoring (sgl-project#13833) Co-authored-by: Mick <mickjagger19@icloud.com> * [chore]Upgrade flashinfer to 0.5.3 (sgl-project#13751) * [Intel XPU]support xgrammar backend for intel xpu (sgl-project#13245) * [sgl-kernel Code Clean] Remove useless lightning_attention kernel (sgl-project#13819) * [VLM] Revise InternVL Piecewise CUDA Graph Supporting (sgl-project#13846) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix TorchAO quant in VLM (sgl-project#13508) Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> * [Fix]: Adjust FutureMap's token_id_bufs Size to Prevent ChunkedPrefill's next_token_ids from Overwriting Previous Prefill Requests' next_token_id (sgl-project#13713) Signed-off-by: vito.yy <vito.yy@antgroup.com> * Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (sgl-project#11871) * [Fix] Fix uvloop get_event_loop() is not suitable for 0.22.x (sgl-project#13612) Signed-off-by: lzy <tomlzy213@gmail.com> Co-authored-by: lzy <tomlzy213@gmail.com> * Tiny unpin uvloop for other backends (sgl-project#13858) * [model-gateway] Refactor router e2e responses tests (sgl-project#13745) Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> * [Perf] Optimize DeepSeek-R1 w4afp8 glue kernels (sgl-project#10027) Co-authored-by: Fan Yin <1106310035@qq.com> * Fix quantized moe checker fail for Qwen3 dense fp8 model (sgl-project#13853) * [model-gateway] add grpc server code owner (sgl-project#13865) * [BugFix] fix outplace_fused_experts missing is_gated (sgl-project#13864) * fix xgrammar_backend crash with malformed inputs (sgl-project#13752) * [Auto Sync] Update schedule_batch.py, schedule_policy.py, b... (20251122) (sgl-project#13763) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Hanming Lu <hanming@x.ai> * [Doc] Add an Introduction to Expert Parallelism (sgl-project#13783) * add LoRA warning if loading a preexisting LoRA adapter with a different name (sgl-project#13822) * [NPU] Fix NPU CI (sgl-project#13834) Co-authored-by: c30031083 <chenxu140@huawei.com> * Overlap glm moe gemms in two cuda streams (sgl-project#13786) * [Performance] Replace preprocess_video logic from GLM multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) (sgl-project#13487) * Add support for bf16 x bf16 cutlass fused MoE (sgl-project#10275) Co-authored-by: Sam Li <lsam@nvidia.com> Co-authored-by: jackeyhua <jackeyhuasjtu@gmail.com> * [Router bugfix] Fix router_manager selecting the wrong router when enable-igw. (sgl-project#13572) * Fix nightly test job to fail when any test fails (sgl-project#13871) * [diffusion] refactor: remove training-related code (sgl-project#13860) * [CI] fix multimodel-gen-test job (sgl-project#13874) * [diffusion] CI: add validation and cleanup for corrupted safetensors in multimodal loader (sgl-project#13870) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] fix lint error (sgl-project#13891) * fix: draft model revision misuse model revision (sgl-project#11893) * Fix trace publish paths in nightly-test-nvidia workflow (sgl-project#13888) * Adding nightly tests for Kimi-K2-thinking, Qwen3, minimax-m2, GLM4.6 (sgl-project#13890) * [Fix] JIT kernel dependencies in other platforms (sgl-project#13889) * remove RoPE CPU fp32 tests (sgl-project#13827) Co-authored-by: Fan Yin <1106310035@qq.com> * Move test_dummy_grok_models.py from manual to srt (temporary) (sgl-project#13901) * [CI tiny fix] Enhance robustness of vision chunked prefill test with ROUGE-L metric (sgl-project#13793) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * update flashinfer_cubin==0.5.3 (sgl-project#13848) * fix * fix --------- Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com> Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Signed-off-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Signed-off-by: vito.yy <vito.yy@antgroup.com> Signed-off-by: lzy <tomlzy213@gmail.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: AlphaBaby <fujianhao1997@qq.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Mike Qiu <qdy220091330@gmail.com> Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com> Co-authored-by: kebyn <kebuyuni@gmail.com> Co-authored-by: lianhu.yin <lianhu.yin@nio.com> Co-authored-by: kebyn <kebyn@kebyn.cc> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Carlo Mussolini <48855305+Carlomus@users.noreply.github.com> Co-authored-by: Rain H <2510421000@qq.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Sirut Buasai <73297481+sirutBuasai@users.noreply.github.com> Co-authored-by: Vedant V Jhaveri <vedantjh2@gmail.com> Co-authored-by: Kaixi Hou <kaixih@nvidia.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: Zijian Zhang <35801754+futrime@users.noreply.github.com> Co-authored-by: wingedge <handkodu@gmail.com> Co-authored-by: bianfeng <bianfeng@pinduoduo.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: alisonshao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Nicholas <45984215+liusy58@users.noreply.github.com> Co-authored-by: liusy58 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: Hao Chen <cighao@gmail.com> Co-authored-by: Morpheus Guo <yuechao.guo@amd.com> Co-authored-by: yuechguo <yuechguo@amd.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com> Co-authored-by: Baidu-AIAK <Baidu_AIAK@163.com> Co-authored-by: Chen Haozhe <c-34@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ykwd <oneday117@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: Even Zhou <even.y.zhou@outlook.com> Co-authored-by: Roger Young <42564206+rogeryoungh@users.noreply.github.com> Co-authored-by: xuebi <xuebi@minimaxi.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sehoon Kim <sehoon@x.ai> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Yuhao Yang <yhyang201@gmail.com> Co-authored-by: StonyPort <157573149+zhooooong@users.noreply.github.com> Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> Co-authored-by: Zeyu Li <li_zeyu@pku.edu.cn> Co-authored-by: iLeGend <824040212@qq.com> Co-authored-by: joesun <shauntajoesph@gmail.com> Co-authored-by: Thomas Wang <1am9trash@gmail.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: yctseng0211 <yctseng@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m12-17.cs-aus.dcgpu> Co-authored-by: jacky.cheng <yichiche@amd.com> Co-authored-by: Lzhang-hub <57925599+Lzhang-hub@users.noreply.github.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Fan Yin <1106310035@qq.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: zyksir <zhuyikai.zyk@gmail.com> Co-authored-by: Zhuqi Li <zhli@x.ai> Co-authored-by: Michele Marzollo <37903931+michelemarzollo@users.noreply.github.com> Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com> Co-authored-by: weibingo <weibing_lai@163.com> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: roikoren755 <26850796+roikoren755@users.noreply.github.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: cctry <shiyang@x.ai> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: richhuan <huan_rz@qq.com> Co-authored-by: ZhengdQin <46387172+ZhengdQin@users.noreply.github.com> Co-authored-by: yinghui <32845984+cicirori@users.noreply.github.com> Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Co-authored-by: ErsongWang <158176536+ErsongWang@users.noreply.github.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: Swipe4057 <106391009+Swipe4057@users.noreply.github.com> Co-authored-by: liuhuijiayou <46172426+liuhuijiayou@users.noreply.github.com> Co-authored-by: Tiance Wang <wangtiance@gmail.com> Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: Xu Yongfei <xuyongfei.xyf@antgroup.com> Co-authored-by: gaopengff <pengfei.gao@intel.com> Co-authored-by: ant-yy <vito.yy@antgroup.com> Co-authored-by: Zhi Yiliu <2584074296@qq.com> Co-authored-by: lzy <tomlzy213@gmail.com> Co-authored-by: Xinyue Zhang <xinyue.zhang@oracle.com> Co-authored-by: Yuhao Yao <37280700+yuhyao@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Hanming Lu <hanming@x.ai> Co-authored-by: c30031083 <chenxu140@huawei.com> Co-authored-by: Nicolas Castet <26874160+nvcastet@users.noreply.github.com> Co-authored-by: Sam Li <lsam@nvidia.com> Co-authored-by: jackeyhua <jackeyhuasjtu@gmail.com> Co-authored-by: Siyuan Chen <41201609+SYChen123@users.noreply.github.com> Co-authored-by: Yibo Cai <cyb70289@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Zaili Wang <109502517+ZailiWang@users.noreply.github.com> Co-authored-by: josephyou <josephyou@tencent.com>
Motivation
This PR is to add fused_rmsnorm_gated_cpu kernel for Qwen3-Next support in CPU.
Modifications
Add native kernel fused_rmsnorm_gated_cpu with AVX512 optimized.
Accuracy Tests
python test/srt/cpu/test_norm.py TestFusedRMSNormGated
Benchmarking and Profiling
Checklist