[Feature] Tune fp8 Gemm and fused moe kernel on B200

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 2. Please use English, otherwise it will be closed.

### Motivation

The performance of w8a8 gemm kernel and fused moe kernel is not good enough on B200. There is some space for tuning.

### Related resources

Reproduction on 8*B200:
```bash
python3 -m sglang.bench_one_batch --model-path /dev/shm/DeepSeek-V3 --tp 8 --batch 16 --input-len 1024 --output-len 128 --attention-backend triton --profile
```

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Tune fp8 Gemm and fused moe kernel on B200 #6095

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Tune fp8 Gemm and fused moe kernel on B200 #6095

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions