Skip to content

[Feature] sgl-kernel wheel slimming plan tracking #17865

@BBuf

Description

@BBuf

Checklist

Motivation

In h100:

============================================================================================================================================
CUDA Kernel Size Analysis
============================================================================================================================================

Total kernels: 10289
Total size: 1633.17 MB (1,712,508,080 bytes)
Average kernel size: 162.54 KB

============================================================================================================================================
Kernel Groups (by name prefix) - Top 200
============================================================================================================================================
Rank   Kernel Prefix                                                                    Count    Total (MB)   %       
--------------------------------------------------------------------------------------------------------------------------------------------
1      void cutlass::device_kernel                                                      3648     651.90       39.92   
2      void marlin_moe_wna16::Marlin                                                    720      370.32       22.67   
3      void marlin::Marlin                                                              1080     292.63       17.92   
4      void fast_hadamard_transform_kernel                                              294      89.20        5.46    
5      void flash::flash_fwd_sparse_kernel                                              32       11.43        0.70    
6      void cutlass::Kernel2                                                            96       11.24        0.69    
7      void per_token_group_quant_8bit_kernel                                           196      8.82         0.54    
8      void flashinfer::sampling::TopKTopPSamplingFromProbKernel                        40       7.54         0.46    
9      void flashinfer::sampling::TopKSamplingFromProbKernel                            40       7.50         0.46    
10     void router_gemm_kernel_bf16_output                                              64       6.89         0.42    
11     void router_gemm_kernel_float_output                                             64       6.83         0.42    
12     void mscclpp::executionKernel                                                    30       5.48         0.34    
13     void moe_fused_gate_kernel                                                       24       3.51         0.22    
14     void flashinfer::sampling::OnlineSoftmaxFusedKernel                              40       3.44         0.21    
15     void topkGatingSigmoid                                                           54       3.43         0.21    
16     void flashinfer::sampling::ChainSpeculativeSampling                              20       3.33         0.20    
17     void flashinfer::sampling::TopPSamplingFromProbKernel                            20       3.28         0.20    
18     void flashinfer::norm::FusedAddRMSNormKernel                                     30       3.15         0.19    
19     void topkGatingSoftmax                                                           54       3.03         0.19    
20     void flashinfer::BatchQKApplyRotaryPosIdsCosSinCacheEnhancedKernel               96       2.92         0.18    

On the H100, the size of the sgl-kernel wheel is 1633 MB. We need to move all kernels except for attention to the jit_kernel as much as possible to reduce the volume of the sgl-kernel wheel.

Related to #17035

Plan

Related resources

#17035

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions