Checklist
Motivation
The performance of w8a8 gemm kernel and fused moe kernel is not good enough on B200. There is some space for tuning.
Related resources
Reproduction on 8*B200:
python3 -m sglang.bench_one_batch --model-path /dev/shm/DeepSeek-V3 --tp 8 --batch 16 --input-len 1024 --output-len 128 --attention-backend triton --profile
No response