[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE by ch-wan · Pull Request #8515 · sgl-project/sglang

ch-wan · 2025-07-29T08:15:07Z

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…factor/deprecate-ep

trevor-m · 2025-07-31T18:14:53Z

This commit is causing 0 accuracy with modelopt FP4 moe. Will try to debug

zhyncs · 2025-07-31T18:51:05Z

Hi @trevor-m sorry for this. And thanks for your help!

hebiao064 · 2025-07-31T18:53:13Z

do we have 5/N?

ch-wan · 2025-07-31T19:43:41Z

@trevor-m Sorry for causing inconvenience. I was trying to test DeepSeek-FP4 before merging this PR but confronted issues for loading the model.

  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 637, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 559, in load_model
    model = _initialize_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 171, in _initialize_model
    quant_config = _get_quantization_config(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 127, in _get_quantization_config
    quant_config = get_quant_config(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/weight_utils.py", line 211, in get_quant_config
    if config["producer"]["name"] == "modelopt":
KeyError: 'producer'

Here is my command:

python3 -m sglang.launch_server  --model-path Barrrrry/DeepSeek-R1-W4AFP8   --context-length 8192  --tp 2 --ep 2   --trust-remote-code   --enable-ep-moe

Could you share how to run this model so that I can avoid a similar issue happen again?

trevor-m · 2025-08-01T00:15:16Z

@trevor-m Sorry for causing inconvenience. I was trying to test DeepSeek-FP4 before merging this PR but confronted issues for loading the model.

  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 637, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 559, in load_model
    model = _initialize_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 171, in _initialize_model
    quant_config = _get_quantization_config(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 127, in _get_quantization_config
    quant_config = get_quant_config(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/weight_utils.py", line 211, in get_quant_config
    if config["producer"]["name"] == "modelopt":
KeyError: 'producer'

Here is my command:

python3 -m sglang.launch_server  --model-path Barrrrry/DeepSeek-R1-W4AFP8   --context-length 8192  --tp 2 --ep 2   --trust-remote-code   --enable-ep-moe

Could you share how to run this model so that I can avoid a similar issue happen again?

No worries! Here is a command python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 4 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 4
`

…-project#8515)

Merge branch 'sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/192 Reviewed-by: 得泽 <zhangkaihong.zkh@antgroup.com> * fix duplicate args in schedule_batch (sgl-project#7816) * [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187) * docs: update README (sgl-project#7821) * [theta] add py-spy deps * feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762) * Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129) * [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850) * [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853) * Fix cache modules of triton import error (sgl-project#7832) * [router] forward stream_options in request (sgl-project#7860) * Fix illegal memory in trtllm allreduce fusion (sgl-project#7864) * Fix llama4 vision (sgl-project#7840) * Support Mimo-VL (sgl-project#7579) * fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830) * [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724) * Bump xgrammar's version to 0.1.20 (sgl-project#7866) * [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818) * [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846) * Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660) * Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880) * chore: bump v0.4.9.post1 (sgl-project#7882) * Fixes typo in assertion message (sgl-project#7895) * [CI] Add deepep tests to CI (sgl-project#7872) * [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885) * [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838) * Remove unused imports (sgl-project#7898) * [router] Update metrics when request completes (sgl-project#7899) * [feature] Add start step profile argument in /start_profile (sgl-project#7608) * [bugfix] add pd router policy validation (sgl-project#7904) * vlm: support video as an input modality (sgl-project#5888) * Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907) * add sentencepiece as dependency explicitly (sgl-project#7922) * Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449) * [feature]Ascend quantization support (sgl-project#7791) * [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775) * Support Kimi K2 (sgl-project#7940) * [feature] kv transfer support of ascend npu (sgl-project#7795) * fix: minor fix for modelopt weight load compatibility (sgl-project#7953) * temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961) * [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932) * chore: bump sgl-kernel v0.2.5 (sgl-project#7964) * Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968) * chore: upgrade xgrammar 0.1.21 (sgl-project#7962) * delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970) * Fix wrong gemm branch cause 250us slower (sgl-project#7969) * [router] add worker abstraction (sgl-project#7960) * chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971) * chore: bump v0.4.9.post2 (sgl-project#7963) * [minor fix] llama4 hybrid memory (sgl-project#7950) * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h20_3e * [theta] perf: add kimi k2 fused_moe tuning config for h200 * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h30_3e * docs: update README (sgl-project#7985) * Overlap the gating function with shared experts in DeepSeek (sgl-project#7978) * [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814) * [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988) * Tiny fix mooncake log warning wrong output (sgl-project#7952) * [BugFix] add verify logit_bias to avoid crash because of IndexError (sgl-project#7749) * SWA Prefix Cache (sgl-project#7367) * chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997) * Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844) * Support for Phi-1.5 & Phi-2 models (sgl-project#7862) * [Dockerfile] Multi-arch support for ROCm (sgl-project#7902) * [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010) * perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021) * [ci] CI supports use cached models (sgl-project#7874) * [Minor] Remove redundant print (sgl-project#8005) * [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653) * [Feature] CUDA Green Context Support (sgl-project#7649) * Fix flaky CI: test_vlm_models (sgl-project#8006) * Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982) * prevent server crash from potential invalid grammar (sgl-project#7897) * Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035) * fix: modality length mismatch with image_data (sgl-project#7887) * Update CODEOWNERS (sgl-project#8044) * perf: add qwen3-30b-a3b fused moe tuning config for h20 * [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023) * feat: update multimodal data handling in engine entrypoint (sgl-project#8002) * fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022) * Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556) * fix: resolve arm build issue (sgl-project#8052) * concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943) * H20 tune config for Kimi (sgl-project#8047) * Update amd docker image. (sgl-project#8045) * feat: replace Decord with video_reader-rs (sgl-project#5163) * remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058) * update transformers to 4.53.2 (sgl-project#8029) * Fix different device type adjustment in PP (sgl-project#7760) * Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001) * Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077) * Fix CI xeon test with triton 3.3.1 (sgl-project#8086) * fix greenctx stream compability (sgl-project#8090) * [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098) * [Feature] Layer-wise Prefill (sgl-project#7634) * [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992) * refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075) * feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597) * [ci] limit cmake build nproc (sgl-project#8100) * [ci] disable memory imbalance check for draft worker (sgl-project#8108) * [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110) * [ci] recover 8-gpu deepep test (sgl-project#8105) * Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989) * [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884) * Super tiny fix typo (sgl-project#8046) * fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102) * [Hunyuan]: Fix Dense Model Support (sgl-project#8117) * feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030) * refactor: simply MultimodalTokens logic (sgl-project#7924) * [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127) * Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959) * load draft model fix (sgl-project#7506) * [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" (sgl-project#7889) * [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820) * Hicache Storage Layer Prototype (sgl-project#7704) * Revert "Fix different device type adjustment in PP" (sgl-project#8141) * feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136) * fix compressed tensors WNA16 imports (sgl-project#8142) * [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147) * [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113) * [router] Refactor router and policy traits with dependency injection (sgl-project#7987) * [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661) * [Doc] Steps to add a new attention backend (sgl-project#8155) * chore: tune mem fraction static for vlm (sgl-project#6881) * Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302) * Feat: Support audio in Phi4-mm model (sgl-project#8048) * [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931) * [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028) * [router] router metrics cleanup (sgl-project#8158) * [router] allow router to have empty workers (sgl-project#8160) * Add GB200 wide-EP docker (sgl-project#8157) * [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966) * chore: bump sgl-kernel v0.2.6 (sgl-project#8165) * chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166) * [theta] sync bailing * Fix suffix mismatch for the metrics. (sgl-project#8168) * Update README.md (sgl-project#8171) * Clean up server args (sgl-project#8161) * Fix LoRA buffer contamination during adapter eviction (sgl-project#8103) * Fix Dockerfile.gb200 (sgl-project#8169) * [router] add ut for worker and errors (sgl-project#8170) * bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167) * Support start up LoRA server without initial adapters (sgl-project#8019) * Clean warning logs for gate_proj loading in Lora (sgl-project#8172) * Fix tuning_fused_moe_triton.py (sgl-project#8175) * [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115) * Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999) * Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230) * Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176) * feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178) * fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825) * Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181) * feat: update nccl 2.27.6 (sgl-project#8182) * Feat: Support for Persimmon Model (sgl-project#7983) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183) * Fix eagle3 cuda graph (sgl-project#8163) * fix: fix the bug of loading Internvl3 (sgl-project#8067) * Fix dtype error in CI (sgl-project#8197) * Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支 * [router] add ut for pd request, metrics and config (sgl-project#8184) * [feature] enable NPU CI (sgl-project#7935) * [fix] fix modelopt fp4 on b200 (sgl-project#8195) * chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200) * Apply fused sorted token ids padding (sgl-project#8193) * [Refactor] simplify multimodal data processing (sgl-project#8107) * [theta] feat vl name * [router] add ut for pd router (sgl-project#8208) * [router] upgade router version to 0.1.6 (sgl-project#8209) * Remve router gemm output dtype conversion (sgl-project#8204) * chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202) * [Feature] Add a test for Layer-wise Prefill (sgl-project#8231) * docs: update 2025 h2 roadmap (sgl-project#8237) * fix: retrieve mm token by modality, raise error if none (sgl-project#8221) * [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484) * [theta] tune h20 config for qwen3 235b * [theta] tune h20 config for qwen3 235b * fix: sgl-router remove dead code (sgl-project#8257) * [fix] benchmark : routed_scaling_factor is None (sgl-project#8059) * [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822) * Preliminary Support for Qwen3XMLDetector (sgl-project#8260) * chore: bump v0.4.9.post3 (sgl-project#8265) * PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config * [theta] tune h20 config for qwen3 480b * Skip llama4 vision module loading when multimodal disabled (sgl-project#8272) * PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置 * Fix sgl-kernel ci test (sgl-project#8284) * [theta] tune h200 config for qwen3 480b * Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261) * Hicache IO kernel refactoring (sgl-project#8264) * bug fix and tag (sgl-project#8282) * HiCache Fix (sgl-project#8288) * [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130) * [router] add common ut infra to mock worker and app (sgl-project#8295) * fix: workaround for deepgemm warmup issue (sgl-project#8302) * [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133) * Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708) * fix: match chat-template for internvl3 (sgl-project#8262) * Fix gemma3n with hybrid swa (sgl-project#8240) * chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304) * fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685) * feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043) * Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP (sgl-project#7562) * [AMD] Pull latest image for AMD CI (sgl-project#8070) * Pin the version of petit kernel to fix the APIs (sgl-project#8235) * [bug] fix pd completion protocol for batching support (sgl-project#8317) * [router] fix pd model completion request (sgl-project#8303) * fix bug when eos_ids==0 (sgl-project#8315) * [router] add endpoint unit test (sgl-project#8298) * [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310) * chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301) * [router] add streaming unit test (sgl-project#8299) * [router] add request format unit test (sgl-project#8300) * HiCache Storage TP Refinement (sgl-project#8307) * breakdown kernel update (sgl-project#8334) * support idle batch for TBO (sgl-project#8233) * [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619) * DP Enhancement (sgl-project#8280) * fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266) * [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135) * [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000) * chore: upgrade mooncake 0.3.5 (sgl-project#8341) * [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353) * [P/D] Support ipv6 in P/D scenario (sgl-project#7858) * Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344) * [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357) * Clean up server_args, triton cache manager (sgl-project#8332) * fix: upgrade nccl version (sgl-project#8359) * [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363) * fix: kimi k2 xgrammar crash (sgl-project#8367) * Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333) * [CI] Fix flaky threshold (sgl-project#8370) * chore: bump v0.4.9.post4 (sgl-project#8305) * Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374) * Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356) * chore: improve mmmu benchmark (sgl-project#7000) * Save peak memory in logits processor (sgl-project#8343) * Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267) * chore: improvements on mm_utils (sgl-project#7737) * vlm: optimize tensor transport (sgl-project#6003) * Tiny assert EPLB is used together with expert parallel (sgl-project#8381) * model: support intern-s1 (sgl-project#8350) * Add perf tests for LoRA (sgl-project#8314) * Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396) * Add docker release flow for gb200 (sgl-project#8394) * HiCache, check before terminate prefetching (sgl-project#8372) * Add nvfp4 scaled mm benchmark. (sgl-project#8401) * Urgent Fix: intern-s1 chat-template matching (sgl-project#8403) * Tool to dump and compare internal activation tensors (sgl-project#7976) * Minor tool for comparison of benchmark results (sgl-project#7974) * Fix bench script making input data on L2 cache (sgl-project#7739) * [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036) * Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392) * fix: minor fix TransportProxyTensor under tp (sgl-project#8382) * [router] add different policies for p node and d node (sgl-project#8395) * Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351) * fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720) * [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397) * Use FlashInfer FP4 gemm. (sgl-project#8241) * Support precomputed_embeddings for Llama 4 (sgl-project#8156) * [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405) * chore: update CODEOWNERS (sgl-project#8407) * chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406) * Support triton kernels v3.4.0 for fused_moe (sgl-project#8258) * [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062) * Change to use native arm runner (sgl-project#8414) * Support overlapped lora updates (sgl-project#8213) * Support ue8m0 for triton quant kernel (sgl-project#7603) * Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316) * bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417) * Fix test_openai_server (sgl-project#8419) * Fix docker buildx push error (sgl-project#8425) * bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422) * [router] improve router logs and request id header (sgl-project#8415) * [feat] Support different attention backends for prefill and decode (sgl-project#6338) * chore: bump transformer to 4.54.0 (sgl-project#8416) * [PD] Fix abort_request for PD disaggregation (sgl-project#8352) * GLM-4.5 Model Support (sgl-project#8224) * Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442) * doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441) * GLM-4.5 Model Support Follow-up (sgl-project#8445) * fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456) * Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449) * Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457) * chore: bump v0.4.9.post5 (sgl-project#8458) * fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125) * perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049) * Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支 * Update PR template (sgl-project#8465) * feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565) * [theta] tuning script for glm4 moe * perf: add fused moe kernel config glm4.5,h20-3e,tp8 * [theta] tuning script for glm4 moe h20 * fix: update dep (sgl-project#8467) * [NVIDIA] Change to use `num_local_experts` (sgl-project#8453) * Fix parsing ChatCompletionMessage (sgl-project#7273) * [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421) * feat: support glm4 tuning (sgl-project#8473) * Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469) * Update codeowner (sgl-project#8476) * chore: add glm4 fp8 tp8 config (sgl-project#8478) * chore: add glm 4.5 fp8 tp4 config (sgl-project#8480) * [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477) * Update CODEOWNERS (sgl-project#8485) * Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486) * Reduce memory usage for fp4 moe (sgl-project#8413) * Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426) * Support colocating requests (sgl-project#7973) * Fix incorrect KV cache allocation for MTP models. (sgl-project#8482) * Add PVC and update resource limits in k8s config (sgl-project#8489) * chore: bump v0.4.9.post6 (sgl-project#8517) * Always trigger pr-test (sgl-project#8527) * Update README.md (sgl-project#8528) * [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499) * Update cutlass_moe.py (sgl-project#8535) * Fix moe align kernel test (sgl-project#8531) * Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483) * bring back kimi vl ci (sgl-project#8537) * fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431) * Support EPLB in FusedMoE (sgl-project#8448) * feat(hicache): support file backend reading directory config form env. (sgl-project#8498) * feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516) * [router] allow longer time out for router e2e (sgl-project#8560) * Update cutlass_moe.py (sgl-project#8545) * Update CODEOWNERS (sgl-project#8562) * [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869) * [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461) * Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支 * [Model] Add support for Arcee Foundational Model (sgl-project#8154) * Revert "Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556)" (sgl-project#8584) * Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280) * [router] migrate router from actix to axum (sgl-project#8479) * [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564) * Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541) * Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544) * Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365) * update sgl-kernel for EP: kernel part (sgl-project#8514) * chore: bump sgl-kernel v0.2.8 (sgl-project#8599) * [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404) * fix incorrect increase of hit count (sgl-project#8533) * Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211) * [theta] Conditionally import HiCacheHF3FS sgl-project#8598 * update sgl-kernel for EP: python part (sgl-project#8550) * add SVG logo (sgl-project#8603) * [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515) * fix: fork should not run pypi router (sgl-project#8604) * model: support Step3V (sgl-project#8583) * [Feature] Hybrid EP and TP (sgl-project#8590) * chore: bump v0.4.10 (sgl-project#8608) * [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595) * [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611) * Fix hf3fs_fuse import error (sgl-project#8623) * Update step3v default config (sgl-project#8626) * [ci] fix genai-bench execution cmd (sgl-project#8629) * [router] update router pypi version (sgl-project#8628) * [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577) * Fix typos in py_test/test_launch_server.py (sgl-project#6227) * misc: Remove debug print to logger.info (sgl-project#8633) * SGLang HiCache NIXL Connector (sgl-project#8488) * [bug] remove pdlb from minilb since its no longer available (sgl-project#8634) * [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630) * Conditionally import HiCacheHF3FS (sgl-project#8598) * TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632) * Fix nan value generated after custom all reduce (sgl-project#8532) * Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642) * Feature/modelscope model download (sgl-project#8083) * chore: speedup NPU CI by cache (sgl-project#8270) * [Bugfix] fix w8a8_int8 load issue (sgl-project#8308) * [bugfix] fix router python parser for pd urls (sgl-project#8644) * [router] add basic usage doc (sgl-project#8640) * [router] upgrade router version to 0.1.8 (sgl-project#8645) * [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450) * HiCache, fixing hash value indexing (sgl-project#8636) * Interface change for kvcache io to support page first layout (sgl-project#8318) * Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051) * chore: bump v0.4.10.post1 (sgl-project#8652) * Add hf3fs_utils.cpp to package-data (sgl-project#8653) * Fix chat template handling for OpenAI serving (sgl-project#8635) * Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511) * [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658) * Increase tolerance to address CI failures (sgl-project#8643) * [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013) * [DOC]Update sgl-kernel README (sgl-project#8665) * fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543) * fix arg typo for --disaggregation-transfer-backend (sgl-project#8664) * [fix] fix pd disagg error of vlms (sgl-project#8094) * Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647) * [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685) * [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684) * Update CODEOWNERS (sgl-project#8686) * Fix deepgemm masked grouped gemm jit compile (sgl-project#8679) * Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648) * bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668) * upgrade xgrammar 0.1.22 (sgl-project#8522) * [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618) * Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238) * [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364) * chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694) * [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688) * Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434) * model: adapt mllama4 to VisionAttention (sgl-project#8512) * Add tensor.detach() back to update weight util (sgl-project#8691) * [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704) * [theta] merge 0802-3 * Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706) * [router] minor code clean up and and refactoring (sgl-project#8711) * [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701) * chore: bump sgl-kernel v0.2.9 (sgl-project#8713) * Remove assertions about per group quant fp8 (sgl-project#8717) * [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693) * Fix triton moe error caused by TopK refactor (sgl-project#8705) * [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714) * [Feature] Radix Tree in C++ (sgl-project#7369) * [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722) * Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709) * Tiny fix CI pytest error (sgl-project#8524) * [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721) * Support limiting max loaded loras in CPU. (sgl-project#8650) * Reduce memory accumulation in long-running server (sgl-project#8306) * HiCache storage, style change and bug fix (sgl-project#8719) * [feat] support minimum token load balance in dp attention (sgl-project#7379) * Do layernorm before allgather for DP attention (sgl-project#8631) * [fix] Fix divide by zero error for llama4. (sgl-project#8683) * feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547) * [Improvements] Merge health check route (sgl-project#8444) * chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718) * Save cuda graph memory for fa3 (sgl-project#8567) * [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579) * [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723) * chore: bump v0.4.10.post2 (sgl-project#8727) * feat: Support DP Attention for step3_vl (sgl-project#8699) * [RL] fix update weight for FusedMoE with EP (sgl-project#8676) * use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729) * Fix triton kernels topk with keyword arguments (sgl-project#8732) * feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678) * Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114) * fix args typo in memory_pool_host (sgl-project#8662) * [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737) * [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735) * Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733) * Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745) * Rename lora_path to lora_id in batches (sgl-project#8437) * [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738) * [CI] Ascend NPU CI enhancement (sgl-project#8294) * [bugfix] fix import path in HiCacheController (sgl-project#8749)

ch-wan and others added 20 commits July 27, 2025 23:53

Support EPLB in FusedMoE

22684cd

format

2a5f299

Update layer.py

9cb6b17

Merge branch 'main' into cheng/fusedmoe-eplb

257c4b7

fix mixtral

f55d09b

fix

da672c6

fix: many models do not support expert location metadata

7616696

Merge branch 'main' into cheng/fusedmoe-eplb

f5a2e4c

pylint?

663ca36

fix compatibility with shared experts fusion

91128f3

fix glm4_moe

39bd6f9

fix glm4_moe

f3bf4d6

wip

319ce4e

fix

4ffd4a6

wip

e927ed3

wip

409b33b

update

cb5821c

almost ready

9c77430

update sgl-kernel for ep

bc73b8e

fix

5576959

ch-wan requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, Ying1123, ispobock, kushanam, merrymercy, yizhang2077 and zhyncs as code owners July 29, 2025 08:15

ch-wan added 3 commits July 30, 2025 16:22

minor

c47e9c1

fix

d894a4f

fix

064fa4e

ch-wan mentioned this pull request Jul 31, 2025

[Feature] Hybrid EP and TP #8590

Merged

6 tasks

ch-wan changed the title ~~[4/N] [WIP] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE~~ [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE Jul 31, 2025

ch-wan and others added 6 commits July 30, 2025 18:43

revert cpu change

d17811a

Merge commit 'e3f08c77bc8ec4bf78501305a5aa15a779ad9ff2' into cheng/re…

f02e5d7

…factor/deprecate-ep

Merge commit 'd17811a07b4223beaad06799b2e45d7d83e76f53' into cheng/re…

76eedb8

…factor/deprecate-ep

Merge commit '064fa4ee84e0025edbe9f234e8ae2c8daf8e03a2' into cheng/re…

65192cd

…factor/deprecate-ep

Merge branch 'main' into cheng/refactor/deprecate-ep

13a95ea

Merge commit 'e179e0b79738b0718f2ae4e8653554188badc904' into cheng/re…

650eb5c

…factor/deprecate-ep

zhyncs merged commit 32fa1e9 into main Jul 31, 2025
5 of 61 checks passed

zhyncs deleted the cheng/refactor/deprecate-ep branch July 31, 2025 09:34

trevor-m mentioned this pull request Jul 31, 2025

[bugfix] Fix flashinfer cutlass EP moe after MoE refactor #8630

Merged

6 tasks

huangzhilin-hzl pushed a commit to huangzhilin-hzl/sglang that referenced this pull request Aug 1, 2025

[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl…

258a177

…-project#8515)

TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025

[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl…

0539c09

…-project#8515)

ch-wan mentioned this pull request Aug 2, 2025

[Roadmap] MoE Refactor #8715

Open

66 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl…

b59e145

…-project#8515)

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl…

8ce594e

…-project#8515)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE#8515

[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE#8515
zhyncs merged 44 commits intomainfrom
cheng/refactor/deprecate-ep

ch-wan commented Jul 29, 2025

Uh oh!

Uh oh!

trevor-m commented Jul 31, 2025

Uh oh!

zhyncs commented Jul 31, 2025

Uh oh!

hebiao064 commented Jul 31, 2025

Uh oh!

ch-wan commented Jul 31, 2025

Uh oh!

trevor-m commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

ch-wan commented Jul 29, 2025

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

Uh oh!

trevor-m commented Jul 31, 2025

Uh oh!

zhyncs commented Jul 31, 2025

Uh oh!

hebiao064 commented Jul 31, 2025

Uh oh!

ch-wan commented Jul 31, 2025

Uh oh!

trevor-m commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments