-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
In h100:
============================================================================================================================================
CUDA Kernel Size Analysis
============================================================================================================================================
Total kernels: 10289
Total size: 1633.17 MB (1,712,508,080 bytes)
Average kernel size: 162.54 KB
============================================================================================================================================
Kernel Groups (by name prefix) - Top 200
============================================================================================================================================
Rank Kernel Prefix Count Total (MB) %
--------------------------------------------------------------------------------------------------------------------------------------------
1 void cutlass::device_kernel 3648 651.90 39.92
2 void marlin_moe_wna16::Marlin 720 370.32 22.67
3 void marlin::Marlin 1080 292.63 17.92
4 void fast_hadamard_transform_kernel 294 89.20 5.46
5 void flash::flash_fwd_sparse_kernel 32 11.43 0.70
6 void cutlass::Kernel2 96 11.24 0.69
7 void per_token_group_quant_8bit_kernel 196 8.82 0.54
8 void flashinfer::sampling::TopKTopPSamplingFromProbKernel 40 7.54 0.46
9 void flashinfer::sampling::TopKSamplingFromProbKernel 40 7.50 0.46
10 void router_gemm_kernel_bf16_output 64 6.89 0.42
11 void router_gemm_kernel_float_output 64 6.83 0.42
12 void mscclpp::executionKernel 30 5.48 0.34
13 void moe_fused_gate_kernel 24 3.51 0.22
14 void flashinfer::sampling::OnlineSoftmaxFusedKernel 40 3.44 0.21
15 void topkGatingSigmoid 54 3.43 0.21
16 void flashinfer::sampling::ChainSpeculativeSampling 20 3.33 0.20
17 void flashinfer::sampling::TopPSamplingFromProbKernel 20 3.28 0.20
18 void flashinfer::norm::FusedAddRMSNormKernel 30 3.15 0.19
19 void topkGatingSoftmax 54 3.03 0.19
20 void flashinfer::BatchQKApplyRotaryPosIdsCosSinCacheEnhancedKernel 96 2.92 0.18 On the H100, the size of the sgl-kernel wheel is 1633 MB. We need to move all kernels except for attention to the jit_kernel as much as possible to reduce the volume of the sgl-kernel wheel.
Related to #17035
Plan
-
move marlin fused moe kernel (csrc/moe/marlin_moe_wna16/ops.cu): https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/marlin_moe_wna16/ops.cu
-
move csrc/gemm/marlin/gptq_marlin.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/marlin/gptq_marlin.cu (≈97.5MB) @celve [Kernel] Migrate GPTQ-Marlin GEMM kernel to JIT #18067
-
move csrc/gemm/marlin/gptq_marlin_repack.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/marlin/gptq_marlin_repack.cu (≈97.5MB) [Kernel Slimming] Migrate GPTQ-Marlin repack kernel to JIT #18543
-
move csrc/gemm/marlin/awq_marlin_repack.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/marlin/awq_marlin_repack.cu (≈97.5MB) [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT #18949
-
move (external) fast-hadamard-transform/csrc/fast_hadamard_transform_cuda.cu: https://github.com/sgl-project/fast-hadamard-transform/blob/48f3c13764dc2ec662ade842a4696a90a137f1bc/csrc/fast_hadamard_transform_cuda.cu (≈89.2MB) [kernel slimming] Move fast_hadamard_transform to jit_kernel #18475 @BBuf
-
move (external) fast-hadamard-transform/csrc/fast_hadamard_transform.cpp: https://github.com/sgl-project/fast-hadamard-transform/blob/48f3c13764dc2ec662ade842a4696a90a137f1bc/csrc/fast_hadamard_transform.cpp (≈<0.1MB) [kernel slimming] Move fast_hadamard_transform to jit_kernel #18475 @BBuf
-
move (external) flashinfer/csrc/sampling.cu: https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/csrc/sampling.cu (≈30MB) use flashinfer.sampling #18696 @pansicheng
-
move (external) flashinfer/csrc/renorm.cu: https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/csrc/renorm.cu (≈4MB) @Johnsonms Migrate renorm kernels from sgl-kernel to FlashInfer JIT #18854
-
move (external) flashinfer/csrc/norm.cu: https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/csrc/norm.cu (≈5MB) @Johnsonms
-
move csrc/gemm/dsv3_router_gemm_entry.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu (≈<0.1MB)
-
move csrc/gemm/dsv3_router_gemm_bf16_out.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_router_gemm_bf16_out.cu (≈6.9MB)
-
move csrc/gemm/dsv3_router_gemm_float_out.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_router_gemm_float_out.cu (≈6.8MB)
-
move csrc/gemm/dsv3_fused_a_gemm.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_fused_a_gemm.cu (≈unknown)
-
move csrc/gemm/fp8_gemm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/fp8_gemm_kernel.cu (≈unknown)
-
move csrc/gemm/fp8_blockwise_gemm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/fp8_blockwise_gemm_kernel.cu (≈unknown)
-
move csrc/gemm/bmm_fp8.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/bmm_fp8.cu (≈unknown)
-
move csrc/gemm/per_token_group_quant_8bit.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_token_group_quant_8bit.cu (≈4.4MB) @JayceSu98 https://github.com/JayceSu98
-
move csrc/gemm/per_token_group_quant_8bit_v2.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_token_group_quant_8bit_v2.cu (≈4.4MB)
-
move csrc/gemm/gptq/gptq_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/gptq/gptq_kernel.cu (≈7.3MB)
-
move csrc/allreduce/mscclpp_allreduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/allreduce/mscclpp_allreduce.cu (≈5.5MB)
-
move csrc/moe/moe_fused_gate.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_fused_gate.cu (≈4.6MB)
-
move csrc/moe/kimi_k2_moe_fused_gate.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu (≈0.4MB)
-
move csrc/moe/moe_topk_sigmoid_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_topk_sigmoid_kernels.cu (≈3.4MB)
-
move csrc/moe/moe_topk_softmax_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu (≈3.0MB)
-
move csrc/moe/moe_sum_reduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_sum_reduce.cu (≈1.0MB)
-
move csrc/moe/moe_sum.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_sum.cu (≈0.3MB)
-
move csrc/moe/moe_align_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_align_kernel.cu (≈1.1MB)
-
move csrc/moe/prepare_moe_input.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/prepare_moe_input.cu (≈0.3MB)
-
move csrc/moe/fused_qknorm_rope_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/fused_qknorm_rope_kernel.cu (≈0.9MB)
-
move csrc/moe/fp8_blockwise_moe_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/fp8_blockwise_moe_kernel.cu (≈unknown)
-
move csrc/moe/nvfp4_blockwise_moe.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/nvfp4_blockwise_moe.cu (≈unknown)
-
move csrc/quantization/gguf/gguf_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/quantization/gguf/gguf_kernel.cu (≈15–25MB)
-
move csrc/mamba/causal_conv1d.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/mamba/causal_conv1d.cu (≈3.4MB)
-
move csrc/gemm/per_token_quant_fp8.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_token_quant_fp8.cu (≈1.9MB)
-
move csrc/gemm/per_tensor_quant_fp8.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_tensor_quant_fp8.cu (≈0.3MB)
-
move csrc/gemm/int8_gemm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/int8_gemm_kernel.cu (≈unknown)
-
move csrc/gemm/awq_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/awq_kernel.cu (≈unknown)
-
move csrc/gemm/qserve_w4a8_per_group_gemm.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/qserve_w4a8_per_group_gemm.cu (≈unknown)
-
move csrc/gemm/qserve_w4a8_per_chn_gemm.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/qserve_w4a8_per_chn_gemm.cu (≈unknown)
-
move csrc/gemm/nvfp4_expert_quant.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu (≈unknown)
-
move csrc/gemm/nvfp4_quant_entry.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_quant_entry.cu (≈unknown)
-
move csrc/gemm/nvfp4_quant_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_quant_kernels.cu (≈unknown)
-
move csrc/gemm/nvfp4_scaled_mm_entry.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_scaled_mm_entry.cu (≈unknown)
-
move csrc/gemm/nvfp4_scaled_mm_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_scaled_mm_kernels.cu (≈unknown)
-
move csrc/allreduce/custom_all_reduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/allreduce/custom_all_reduce.cu (≈2.4MB)
-
move csrc/allreduce/quick_all_reduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/allreduce/quick_all_reduce.cu (≈<0.5MB)
-
move csrc/kvcacheio/transfer.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/kvcacheio/transfer.cu (≈0.5MB)
-
move csrc/speculative/speculative_sampling.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/speculative_sampling.cu (≈<0.1MB)
-
move csrc/speculative/eagle_utils.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/eagle_utils.cu (≈<0.1MB)
-
move csrc/speculative/ngram_utils.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/ngram_utils.cu (≈<0.1MB)
-
move csrc/speculative/packbit.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/packbit.cu (≈<0.1MB)
-
move csrc/grammar/apply_token_bitmask_inplace_cuda.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/grammar/apply_token_bitmask_inplace_cuda.cu (≈0.5MB)
-
move csrc/expert_specialization/es_fp8_blockwise.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/expert_specialization/es_fp8_blockwise.cu (≈unknown)
-
move csrc/expert_specialization/es_sm100_mxfp8_blockscaled.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled.cu (≈unknown)
-
move csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cu (≈unknown)
-
move csrc/elementwise/topk.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/topk.cu (≈unknown)
-
move csrc/elementwise/rope.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/rope.cu (≈0.2MB) @pansicheng [Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace #18155
-
move csrc/elementwise/fused_add_rms_norm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.cu (≈unknown)
-
move csrc/elementwise/activation.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/activation.cu (≈unknown)
-
move csrc/elementwise/cast.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/cast.cu (≈unknown)
-
move csrc/elementwise/copy.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/copy.cu (≈unknown)
-
move csrc/elementwise/concat_mla.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/concat_mla.cu (≈unknown) [Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels #17889 @celve
-
move csrc/elementwise/pos_enc.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/pos_enc.cu (≈unknown) [Kernel] Add JIT rotary_embedding_kernel #17934 @pansicheng
-
move csrc/elementwise/pos_enc.cuh: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/pos_enc.cuh (≈unknown) @pansicheng
-
move csrc/common_extension.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension.cc (≈<0.1MB)
-
move csrc/common_extension_musa.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension_musa.cc (≈<0.1MB)
-
move csrc/common_extension_rocm.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension_rocm.cc (≈<0.1MB)
-
move csrc/memory/store.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/memory/store.cu (≈<0.1MB)
-
move csrc/memory/weak_ref_tensor.cpp: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/memory/weak_ref_tensor.cpp (≈<0.1MB)
-
move csrc/spatial/greenctx_stream.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/spatial/greenctx_stream.cu (≈<0.1MB)
-
move csrc/spatial_extension.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/spatial_extension.cc (≈<0.1MB)
-
@BBuf move diffusion time_embed kernel to jit_kernel: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/sgl_diffusion/elementwise/timestep_embedding.cu (PR: [Diffusion] Delete sgl-kernel outdated time_embedding kernel #17278) (≈<0.1MB)