[NPU]ACLGraph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support.#11104
Conversation
508e483 to
d77e709
Compare
d77e709 to
4ce70e6
Compare
c958827 to
b974460
Compare
| help="Enable debug mode for torch compile", | ||
| ) | ||
| parser.add_argument( | ||
| "--enable-npu-torchair-compile", |
There was a problem hiding this comment.
Can we reuse the --enable-torch-compile and set specific configs to args like --torch-compile-config?
There was a problem hiding this comment.
Thanks for comment. TorchAIR for NPU can be enabled by --enable-npu-torchair-compile and --compilation-config option:
--compilation-config {"compiler": "npugraph_ex"} --disable-overlap-schedule
Separate command line argument --enable-npu-torchair-compile has been discussed with @yuan-luo offline. We decided to have separate argument for each option.
python/sglang/srt/server_args.py
Outdated
| "--compilation-config", | ||
| type=str, | ||
| default=None, | ||
| help="Compilation config.", |
There was a problem hiding this comment.
rename to --torch-compile-config? Can we make the description more clear?
There was a problem hiding this comment.
Thanks for your comment. Description was extended.
Note, please: --compilation-config represents JSON serialized instance of CompilationConfig class, which has been created before: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/compilation/compilation_config.py. I can rename, if you really need, but it will be different from class name. Should I rename CompilationConfig class too?
a73e3a8 to
1cdcb0d
Compare
1cdcb0d to
ac81122
Compare
29996aa to
9dab727
Compare
9dab727 to
a61b8d6
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
77cb7fc to
d0ed6b3
Compare
d0ed6b3 to
01d892d
Compare
|
/rerun-failed-ci |
| hidden_states = tensor_model_parallel_all_reduce(hidden_states) | ||
| if _is_npu and context.cache is not None: | ||
| _ = prepare_weight_cache(hidden_states, context.cache) | ||
| _ = torch.ops.sglang.prepare_weight_cache( |
There was a problem hiding this comment.
We are against the use of torch.ops.sglang. Please use the new API
There was a problem hiding this comment.
done, thanks for comment
| def __init__( | ||
| self, | ||
| capture_sizes: List[int], | ||
| capture_sizes: List[int] = [], |
There was a problem hiding this comment.
using [] as the default argument is a very bad behavior. It will be inplace changed!
There was a problem hiding this comment.
removed, thanks for comment
| # FIXME: hack to reduce ITL when decode bs is small | ||
| disaggregation_decode_polling_interval: int = 1 | ||
|
|
||
| compilation_config: Optional[CompilationConfig] = None |
There was a problem hiding this comment.
put this under enable_torch_compile
There was a problem hiding this comment.
Done, thanks for comment. compilation_config option is still here, but enable_torch_compile validation existence was added.
32bcb6b to
5c16ce2
Compare
5c16ce2 to
467b543
Compare
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Motivation
This Pull Request is part of architecture update to implement hardware independence layer on
Qwen3model as example. Other models will be supported later. Implemented changes make the model compilable. Implemented in the Pull RequestPassManagerand fusing passes (forfp16and quantized models) based ontorch.fx.replaceremove hardware specific logic fromQwen3model and allow to reuse the same Python models for different hardware with hardware specific optimizations (passes:SplitQkvRmsnormRopeFuse,NpuAddRmsNormQuantFuse,NpuAddRmsNormDynamicQuantFuse). Implemented compiler backends in this Pull Request specify usage of optimal inference for chosen hardware. Additionally, Pull Request has performance gain forfp16and quantized models.sglang.benchserving: Ascend 910A3, batch size = 32, 64, 128:--enable-torch-compileGSM8K Ascend 910A2:
Latency: 826.420 s
Output throughput: 454.471 token/s
Latency: 536.163 s
Output throughput: 684.102 token/s
Latency: 378.931 s
Output throughput: 944.312 token/s
--enable-torch-compileLatency: 798.009 s
Output throughput: 480.432 token/s
Latency: 528.144 s
Output throughput: 720.576 token/s
Latency: 367.730 s
Output throughput: 969.680 token/s
PassManagerfor current and future fuses in Python viatorch.fx.replace_pattern. Fuses can be easily developed by external contributors.AddRmsNormandAscendQuantV2kernels toAddRmsNormQuantkernel:Original comment: [feat] npu support enable_torch_compile #12371
TorchAir (Torch Ascend Intermediate Representation) is an extension library that provides graph mode capabilities for torch_npu. It enables users to perform graph-mode inference on NPU using PyTorch and torch_npu. TorchAir externally offers a torch.compile backend for NPU, which interfaces with torch._dynamo. Through the following features, performance optimization and capability enhancement of the torch fx graph can be achieved.
TorchAir Main Features:
How to enable compilation and fuses for
NPUGraphdecode:Optional customization:
How to enable TorchAir for decode:
Optional customization:
CANN version: 8.2
Torch NPU version:
torch-npu 2.6.0.post3NpuAddRmsNormQuantFusepassNpuAddRmsNormQuantFusepass fusesAddRmsNormandAscentQuanttoAddRmsNormQuant.Before fuse: NPU kernels

AddRmsNormandAscentQuantusage which take 29 microseconds:After fuse: NPU kernel

AddRmsNormQuantusage which takes 19 microseconds:SplitQkvRmsnormRopeFusepassBefore fuse: NPU kernels

RmsNorm,RmsNormandRopeWithSinCosCacheusage which take 62 microseconds:After fuse: NPU kernel

split_kqv_rmsnorm_ropeusage which takes 25 microseconds:Modifications
Model compilation support by
torch.compileUse
--enable-torch-compileto enable compilation and optional--torch-compile-max-bsargument to limit max batch size for compilation.NpuGraphCompilerBackendcompilation backend for NPU Graph capturing. Implemented in:python/sglang/srt/model_executor/compilation/npu_graph_compiler_backend.py, usage:PassManagerpasses manager and passespython/sglang/srt/model_executor/compilation/passes/w8a8_int8andpython/sglang/srt/compilation/npu/passes/fp16.pyto optimize model during compilation. Usage:RotaryEmbeddinglayer use NPU kernel in forward instead native implementationpython/sglang/srt/layers/attention/ascend_backend.py7.1. Rewrite the capture function;
7.2. Encapsulate the kvcache input (input needs all kvcache);
7.3. Pad the block table to the max length;
7.4. TorchAir input preparation;
The calling process is as follows.

Accuracy Tests
Collected on gsm8k dataset for static quantized
Qwen3-32B:TorchAir
Collected on MMMU dataset for
Qwen3-VL-30B-A3B-Instruct:Benchmarking and Profiling (910A3)
Reference
Compilation
Future roadmaps
In the
torch_npu7.2.0 version, the reduce-overhead mode of the torchair backend will support torch.compile(model, dynamic=True). This mode will be set as the default in get_compile_backend(), enabling support for methods wrapped by the@torch.compile()decorator.In the
torch_npu7.3.0 version, the capture and replay ofNPUGraphcurrently integrated in the torchair backend will be changed to optional execution. The torchair backend will only perform optimizations such as fx pass optimization and static kernel compilation, while the capture and replay ofNPUGraphwill be implemented independently. This design is closer to the implementation ofCudaGraphRunner, decoupling fx graph optimization from graph offloading.Checklist