[Experience Sharing] How to optimize Qwen on Ascend


### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

Share our optimization methods for Qwen

# Qwen3-32B
## decode optimization
- **Fused OPs**

    <img width="1055" height="748" alt="Image" src="https://github.com/user-attachments/assets/c8536918-7299-4b07-a9f9-e5306c0b0baf" />
    
    | OPs            | Ascend_OPs                                                                               | location                                                                                                                                                                     |
    | -------------- | ---------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | RMSNorm        | torch_npu.npu_rms_norm<br>torch_npu.npu_add_rms_norm                                     | [layernorm.py::RMSNorm::forward_npu](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/layernorm.py)                                                  |
    | RoPE           | torch_npu.npu_mrope(including cos_sin_cache)                                             | [layernorm.py::RotaryEmbedding::forward_npu](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/rotary_embedding.py)                                   |
    | RadixAttention | prefill: torch_npu._npu_flash_attention_qlens <br> decode:torch_npu._npu_paged_attention | [ascend_backend.py::AscendAttnBackend::forward_extend, forward_decode](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/attention/ascend_backend.py) |
    | SiluAndMul     | torch_npu.npu_swiglu                                                                     | [activation.py::SiluAndMul::forward_npu](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/activation.py)                                             |
    
    
- Summary of other key features

    - **W8A8 quantization**
    
        [[feature]Ascend quantization support](https://github.com/sgl-project/sglang/pull/7791/files)
    
    - **enable ACLGraph**

        > The aim is to improve the performance of Launch task during the decode phase. [issue 8030](https://github.com/sgl-project/sglang/issues/8030)
        
        _Notice_: 
        - Special handling the `actual_seq_lengths` in [torch_npu.npu_fused_infer_attention_score](https://www.hiascend.com/doc_center/source/zh/Pytorch/60RC2/apiref/apilist/ptaoplist_000787.html) during the [npu_graph_runner.py::NPUGraphRunner::replay](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/npu_graph_runner.py).
        - Support `torch_npu._npu_paged_attention`, which is faster than `torch_npu.npu_fused_infer_attention_score`, in ACLGraph. more details see [PR24572](https://gitee.com/ascend/pytorch/pulls/24572/files). In this case, you should special handle the `context_lens` argument.
        
        _注意_：
        - 在 [replay]((https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/npu_graph_runner.py)) 阶段需要特殊处理 [torch_npu.npu_fused_infer_attention_score](https://www.hiascend.com/doc_center/source/zh/Pytorch/60RC2/apiref/apilist/ptaoplist_000787.html) 的入参 `actual_seq_lengths`。关于这个问题的背景和解释详见 [IFA的tiling 依赖actual_seq_len host值，是问题的起源。](https://jx.huawei.com/community/comgroup/postsDetails?postId=b8cdd765d2864b87a298e1c7048691bd&noTop=true&type=freePost&previou=comments&welink_open_uri=aDU6Ly80NzE2NTE3MzE0Nzc5NTcvaHRtbC9pbmRleC5odG1sIy9qeC9kZXRhaWw%2FaWQ9YjhjZGQ3NjVkMjg2NGI4N2EyOThlMWM3MDQ4NjkxYmQmdHlwZT1mcmVlX3Bvc3QmdXJsPQ%3D%3D)
        - 我们发现当前阶段 `torch_npu._npu_paged_attention` 比 `torch_npu.npu_fused_infer_attention_score` 计算快一些，但是不支持ACLGraph入图，所以与torch_npu沟通后提了 [PR24572](https://gitee.com/ascend/pytorch/pulls/24572/files) 来支持。
        
        <img width="854" height="564" alt="Image" src="https://github.com/user-attachments/assets/78c18468-3ef2-4d57-b37b-5501a9bd257a" />

    - **[CMO权重预取] Prefetch the weight of matmul when running the AIV kernels**
        Using [torch_npu.npu_prefetch](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_prefetch.md) to Prefetch the weight of matmul(gate_up, down proj) when running other AIV kernels, aiming to overlap the memory access time.
        
        权重预取通过在计算matmul(gate_up,down proj)前，提前将其右矩阵权重预加载到L2 cache上，从而减少了算子运算阶段的访存开销，缩短了matmul算子执行时间。

## prefill optimization -- sequence parallelism
see #10519 in details

## host optimization
will update late

# Qwen2.5-VL----only Vision part

please see #9189 #10556 #11047

- VisionAttention
   Using `torch_npu._npu_flash_attention_unpad` for attention acceleration, including sin/cos cache and `torch_npu.npu_rotary_mul`

- VisionPatchEmbed
   Using `matmul` instead of `Conv3D` for Patch acceleration.

    > How does it work?
    > feature_map.shape = (N, C, D, H, W);
    > kernel_size.shape = (Cout, C, d, h, w);
    > for Qwen2.5-VL: kernel_size == stride and D=d, H=h, W=w
    > sliding window times: S = (D/d) * (H/h) * (W/w) = 1
    > Conv3D result: `Hidden_state = (N, S, C*d*h*w) x (Cout, C*d*h*w)^T = (N, 1, C*d*h*w) x (Cout, C*d*h*w)^T`
    > Equals to: `Hidden_state =  (N, C*d*h*w) x (Cout, C*d*h*w)^T`

- VisionTransformer

    attention padding

    Because Ascend Cube unit has the highest performance when the input.shape is divisible by 16, we padding the attn_head from [40, 40] to [64, 64] in `Qwen2_5_VLForConditionalGeneration.load_weights`

# Qwen3-30B-A3B
#12078



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experience Sharing] How to optimize Qwen on Ascend #10337

Checklist

Motivation

Qwen3-32B

decode optimization

prefill optimization -- sequence parallelism

host optimization

Qwen2.5-VL----only Vision part

Qwen3-30B-A3B

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OPs	Ascend_OPs	location
RMSNorm	torch_npu.npu_rms_norm torch_npu.npu_add_rms_norm	layernorm.py::RMSNorm::forward_npu
RoPE	torch_npu.npu_mrope(including cos_sin_cache)	layernorm.py::RotaryEmbedding::forward_npu
RadixAttention	prefill: torch_npu._npu_flash_attention_qlens decode:torch_npu._npu_paged_attention	ascend_backend.py::AscendAttnBackend::forward_extend, forward_decode
SiluAndMul	torch_npu.npu_swiglu	activation.py::SiluAndMul::forward_npu

[Experience Sharing] How to optimize Qwen on Ascend #10337

Description

Checklist

Motivation

Qwen3-32B

decode optimization

prefill optimization -- sequence parallelism

host optimization

Qwen2.5-VL----only Vision part

Qwen3-30B-A3B

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions