Skip to content

[Experience Sharing] How to optimize Qwen on Ascend #10337

@ping1jing2

Description

@ping1jing2

Checklist

Motivation

Share our optimization methods for Qwen

Qwen3-32B

decode optimization

prefill optimization -- sequence parallelism

see #10519 in details

host optimization

will update late

Qwen2.5-VL----only Vision part

please see #9189 #10556 #11047

  • VisionAttention
    Using torch_npu._npu_flash_attention_unpad for attention acceleration, including sin/cos cache and torch_npu.npu_rotary_mul

  • VisionPatchEmbed
    Using matmul instead of Conv3D for Patch acceleration.

    How does it work?
    feature_map.shape = (N, C, D, H, W);
    kernel_size.shape = (Cout, C, d, h, w);
    for Qwen2.5-VL: kernel_size == stride and D=d, H=h, W=w
    sliding window times: S = (D/d) * (H/h) * (W/w) = 1
    Conv3D result: Hidden_state = (N, S, C*d*h*w) x (Cout, C*d*h*w)^T = (N, 1, C*d*h*w) x (Cout, C*d*h*w)^T
    Equals to: Hidden_state = (N, C*d*h*w) x (Cout, C*d*h*w)^T

  • VisionTransformer

    attention padding

    Because Ascend Cube unit has the highest performance when the input.shape is divisible by 16, we padding the attn_head from [40, 40] to [64, 64] in Qwen2_5_VLForConditionalGeneration.load_weights

Qwen3-30B-A3B

#12078

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions