[Feature] sgl-kernel wheel slimming plan tracking

### Checklist

- [x] If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Motivation

In h100:

```shell
============================================================================================================================================
CUDA Kernel Size Analysis
============================================================================================================================================

Total kernels: 10289
Total size: 1633.17 MB (1,712,508,080 bytes)
Average kernel size: 162.54 KB

============================================================================================================================================
Kernel Groups (by name prefix) - Top 200
============================================================================================================================================
Rank   Kernel Prefix                                                                    Count    Total (MB)   %       
--------------------------------------------------------------------------------------------------------------------------------------------
1      void cutlass::device_kernel                                                      3648     651.90       39.92   
2      void marlin_moe_wna16::Marlin                                                    720      370.32       22.67   
3      void marlin::Marlin                                                              1080     292.63       17.92   
4      void fast_hadamard_transform_kernel                                              294      89.20        5.46    
5      void flash::flash_fwd_sparse_kernel                                              32       11.43        0.70    
6      void cutlass::Kernel2                                                            96       11.24        0.69    
7      void per_token_group_quant_8bit_kernel                                           196      8.82         0.54    
8      void flashinfer::sampling::TopKTopPSamplingFromProbKernel                        40       7.54         0.46    
9      void flashinfer::sampling::TopKSamplingFromProbKernel                            40       7.50         0.46    
10     void router_gemm_kernel_bf16_output                                              64       6.89         0.42    
11     void router_gemm_kernel_float_output                                             64       6.83         0.42    
12     void mscclpp::executionKernel                                                    30       5.48         0.34    
13     void moe_fused_gate_kernel                                                       24       3.51         0.22    
14     void flashinfer::sampling::OnlineSoftmaxFusedKernel                              40       3.44         0.21    
15     void topkGatingSigmoid                                                           54       3.43         0.21    
16     void flashinfer::sampling::ChainSpeculativeSampling                              20       3.33         0.20    
17     void flashinfer::sampling::TopPSamplingFromProbKernel                            20       3.28         0.20    
18     void flashinfer::norm::FusedAddRMSNormKernel                                     30       3.15         0.19    
19     void topkGatingSoftmax                                                           54       3.03         0.19    
20     void flashinfer::BatchQKApplyRotaryPosIdsCosSinCacheEnhancedKernel               96       2.92         0.18    
```


On the H100, the size of the sgl-kernel wheel is 1633 MB. We need to move all kernels except for attention to the jit_kernel as much as possible to reduce the volume of the sgl-kernel wheel. 

Related to https://github.com/sgl-project/sglang/issues/17035


## Plan

- [ ] move marlin fused moe kernel (csrc/moe/marlin_moe_wna16/ops.cu): https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/marlin_moe_wna16/ops.cu 

- [x] move csrc/gemm/marlin/gptq_marlin.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/marlin/gptq_marlin.cu (≈97.5MB) @celve https://github.com/sgl-project/sglang/pull/18067
- [x] move csrc/gemm/marlin/gptq_marlin_repack.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/marlin/gptq_marlin_repack.cu (≈97.5MB) https://github.com/sgl-project/sglang/pull/18543
- [ ] move csrc/gemm/marlin/awq_marlin_repack.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/marlin/awq_marlin_repack.cu (≈97.5MB) https://github.com/sgl-project/sglang/pull/18949

- [x] move (external) fast-hadamard-transform/csrc/fast_hadamard_transform_cuda.cu: https://github.com/sgl-project/fast-hadamard-transform/blob/48f3c13764dc2ec662ade842a4696a90a137f1bc/csrc/fast_hadamard_transform_cuda.cu (≈89.2MB) https://github.com/sgl-project/sglang/pull/18475 @BBuf 
- [x] move (external) fast-hadamard-transform/csrc/fast_hadamard_transform.cpp: https://github.com/sgl-project/fast-hadamard-transform/blob/48f3c13764dc2ec662ade842a4696a90a137f1bc/csrc/fast_hadamard_transform.cpp (≈<0.1MB) https://github.com/sgl-project/sglang/pull/18475 @BBuf 

- [ ] move (external) flashinfer/csrc/sampling.cu: https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/csrc/sampling.cu (≈30MB) https://github.com/sgl-project/sglang/pull/18696 @pansicheng 
- [ ] move (external) flashinfer/csrc/renorm.cu: https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/csrc/renorm.cu (≈4MB) @Johnsonms https://github.com/sgl-project/sglang/pull/18854
- [ ] move (external) flashinfer/csrc/norm.cu: https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/csrc/norm.cu (≈5MB) @Johnsonms 

- [ ] move csrc/gemm/dsv3_router_gemm_entry.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu (≈<0.1MB)
- [ ] move csrc/gemm/dsv3_router_gemm_bf16_out.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_router_gemm_bf16_out.cu (≈6.9MB)
- [ ] move csrc/gemm/dsv3_router_gemm_float_out.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_router_gemm_float_out.cu (≈6.8MB)
- [ ] move csrc/gemm/dsv3_fused_a_gemm.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/dsv3_fused_a_gemm.cu (≈unknown)

- [ ] move csrc/gemm/fp8_gemm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/fp8_gemm_kernel.cu (≈unknown)
- [ ] move csrc/gemm/fp8_blockwise_gemm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/fp8_blockwise_gemm_kernel.cu (≈unknown)
- [ ] move csrc/gemm/bmm_fp8.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/bmm_fp8.cu (≈unknown)

- [ ] move csrc/gemm/per_token_group_quant_8bit.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_token_group_quant_8bit.cu (≈4.4MB) @JayceSu98 https://github.com/JayceSu98
- [ ] move csrc/gemm/per_token_group_quant_8bit_v2.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_token_group_quant_8bit_v2.cu (≈4.4MB)

- [ ] move csrc/gemm/gptq/gptq_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/gptq/gptq_kernel.cu (≈7.3MB)

- [ ] move csrc/allreduce/mscclpp_allreduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/allreduce/mscclpp_allreduce.cu (≈5.5MB)

- [ ] move csrc/moe/moe_fused_gate.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_fused_gate.cu (≈4.6MB)
- [ ] move csrc/moe/kimi_k2_moe_fused_gate.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu (≈0.4MB)
- [ ] move csrc/moe/moe_topk_sigmoid_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_topk_sigmoid_kernels.cu (≈3.4MB)
- [ ] move csrc/moe/moe_topk_softmax_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_topk_softmax_kernels.cu (≈3.0MB)
- [ ] move csrc/moe/moe_sum_reduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_sum_reduce.cu (≈1.0MB)
- [ ] move csrc/moe/moe_sum.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_sum.cu (≈0.3MB)
- [ ] move csrc/moe/moe_align_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/moe_align_kernel.cu (≈1.1MB)
- [ ] move csrc/moe/prepare_moe_input.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/prepare_moe_input.cu (≈0.3MB)
- [ ] move csrc/moe/fused_qknorm_rope_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/fused_qknorm_rope_kernel.cu (≈0.9MB)
- [ ] move csrc/moe/fp8_blockwise_moe_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/fp8_blockwise_moe_kernel.cu (≈unknown)
- [ ] move csrc/moe/nvfp4_blockwise_moe.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/nvfp4_blockwise_moe.cu (≈unknown)

- [ ] move csrc/quantization/gguf/gguf_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/quantization/gguf/gguf_kernel.cu (≈15–25MB)

- [ ] move csrc/mamba/causal_conv1d.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/mamba/causal_conv1d.cu (≈3.4MB)

- [ ] move csrc/gemm/per_token_quant_fp8.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_token_quant_fp8.cu (≈1.9MB)
- [ ] move csrc/gemm/per_tensor_quant_fp8.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/per_tensor_quant_fp8.cu (≈0.3MB)
- [ ] move csrc/gemm/int8_gemm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/int8_gemm_kernel.cu (≈unknown)
- [ ] move csrc/gemm/awq_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/awq_kernel.cu (≈unknown)
- [ ] move csrc/gemm/qserve_w4a8_per_group_gemm.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/qserve_w4a8_per_group_gemm.cu (≈unknown)
- [ ] move csrc/gemm/qserve_w4a8_per_chn_gemm.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/qserve_w4a8_per_chn_gemm.cu (≈unknown)
- [ ] move csrc/gemm/nvfp4_expert_quant.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_expert_quant.cu (≈unknown)
- [ ] move csrc/gemm/nvfp4_quant_entry.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_quant_entry.cu (≈unknown)
- [ ] move csrc/gemm/nvfp4_quant_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_quant_kernels.cu (≈unknown)
- [ ] move csrc/gemm/nvfp4_scaled_mm_entry.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_scaled_mm_entry.cu (≈unknown)
- [ ] move csrc/gemm/nvfp4_scaled_mm_kernels.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/gemm/nvfp4_scaled_mm_kernels.cu (≈unknown)

- [ ] move csrc/allreduce/custom_all_reduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/allreduce/custom_all_reduce.cu (≈2.4MB)
- [ ] move csrc/allreduce/quick_all_reduce.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/allreduce/quick_all_reduce.cu (≈<0.5MB)

- [ ] move csrc/kvcacheio/transfer.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/kvcacheio/transfer.cu (≈0.5MB)
- [ ] move csrc/speculative/speculative_sampling.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/speculative_sampling.cu (≈<0.1MB)
- [ ] move csrc/speculative/eagle_utils.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/eagle_utils.cu (≈<0.1MB)
- [ ] move csrc/speculative/ngram_utils.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/ngram_utils.cu (≈<0.1MB)
- [ ] move csrc/speculative/packbit.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/speculative/packbit.cu (≈<0.1MB)
- [ ] move csrc/grammar/apply_token_bitmask_inplace_cuda.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/grammar/apply_token_bitmask_inplace_cuda.cu (≈0.5MB)

- [ ] move csrc/expert_specialization/es_fp8_blockwise.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/expert_specialization/es_fp8_blockwise.cu (≈unknown)
- [ ] move csrc/expert_specialization/es_sm100_mxfp8_blockscaled.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled.cu (≈unknown)
- [ ] move csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cu (≈unknown)

- [ ] move csrc/elementwise/topk.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/topk.cu (≈unknown)
- [x] move csrc/elementwise/rope.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/rope.cu (≈0.2MB) @pansicheng https://github.com/sgl-project/sglang/pull/18155
- [ ] move csrc/elementwise/fused_add_rms_norm_kernel.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/fused_add_rms_norm_kernel.cu (≈unknown)
- [ ] move csrc/elementwise/activation.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/activation.cu (≈unknown)
- [ ] move csrc/elementwise/cast.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/cast.cu (≈unknown)
- [ ] move csrc/elementwise/copy.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/copy.cu (≈unknown)
- [x] move csrc/elementwise/concat_mla.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/concat_mla.cu (≈unknown) https://github.com/sgl-project/sglang/pull/17889 @celve
- [ ] move csrc/elementwise/pos_enc.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/pos_enc.cu (≈unknown) https://github.com/sgl-project/sglang/pull/17934 @pansicheng 
- [ ] move csrc/elementwise/pos_enc.cuh: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/elementwise/pos_enc.cuh (≈unknown) @pansicheng 

- [ ] move csrc/common_extension.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension.cc (≈<0.1MB)
- [ ] move csrc/common_extension_musa.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension_musa.cc (≈<0.1MB)
- [ ] move csrc/common_extension_rocm.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension_rocm.cc (≈<0.1MB)
- [ ] move csrc/memory/store.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/memory/store.cu (≈<0.1MB)
- [ ] move csrc/memory/weak_ref_tensor.cpp: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/memory/weak_ref_tensor.cpp (≈<0.1MB)
- [ ] move csrc/spatial/greenctx_stream.cu: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/spatial/greenctx_stream.cu (≈<0.1MB)
- [ ] move csrc/spatial_extension.cc: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/spatial_extension.cc (≈<0.1MB)
- [x] @BBuf move diffusion time_embed kernel to jit_kernel: https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/sgl_diffusion/elementwise/timestep_embedding.cu (PR: https://github.com/sgl-project/sglang/pull/17278) (≈<0.1MB)


### Related resources

https://github.com/sgl-project/sglang/issues/17035



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] sgl-kernel wheel slimming plan tracking #17865

Checklist

Motivation

Plan

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] sgl-kernel wheel slimming plan tracking #17865

Description

Checklist

Motivation

Plan

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions