[NPU] Piecewise Graph for decode with PassManager & fuses#15332
[NPU] Piecewise Graph for decode with PassManager & fuses#15332eshoguli wants to merge 76 commits intosgl-project:mainfrom
Conversation
Co-authored-by: ZhengdQin <zhengdqin@gmail.com>
…r & ChannelQuantScaleParameter support
Summary of ChangesHello @eshoguli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances NPU (Ascend NPU) inference performance by introducing a piecewise graph compilation framework. It allows for the decomposition of the computational graph into smaller, optimizable segments, which are then processed by a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality for NPU graph compilation, including a piecewise graph runner and a PassManager for optimizations. The changes are extensive and touch many parts of the compilation and runtime stack. While the overall direction is good, there are several areas that need improvement in terms of correctness, maintainability, and clarity. Specifically, there are risky monkey-patching practices, obscure hacks for graph compilation, and hardcoded values that make the code brittle. Addressing these issues will be crucial for the long-term stability and maintainability of this new NPU backend.
| torch.cuda.CUDAGraph = torch.npu.NPUGraph | ||
| torch.cuda.synchronize = torch.npu.synchronize | ||
| torch.cuda.graph = torch.npu.graph | ||
| torch.cuda.stream = torch.npu.stream | ||
| torch.cuda.Stream = torch.npu.Stream | ||
| torch.cuda.current_stream = torch.npu.current_stream | ||
| torch.cuda.graph_pool_handle = torch.npu.graph_pool_handle |
There was a problem hiding this comment.
Monkey-patching the torch.cuda namespace to alias torch.npu is highly risky and can lead to unexpected behavior in other parts of the codebase that rely on torch.cuda for actual CUDA operations. This global side effect makes the code harder to reason about and maintain. Please consider a more localized approach, such as creating a compatibility module or using conditional imports where NPU-specific functions are needed, rather than patching the entire namespace.
| if ( | ||
| self.enable_piecewise_npu_graph_decode | ||
| and torch.compiler.is_dynamo_compiling() | ||
| ): | ||
| # input args for submodule forward | ||
| forward_batch.req_to_token_pool.req_to_token.add_( | ||
| forward_batch.req_to_token_pool.req_to_token | ||
| ) | ||
| forward_batch.req_pool_indices.add_(forward_batch.req_pool_indices) | ||
| forward_batch.seq_lens.add_(forward_batch.seq_lens) |
There was a problem hiding this comment.
The use of in-place add_ operations on tensors to mark them as inputs for torch.compile is an obscure hack. This code doubles the tensor values, which could lead to correctness issues if not handled carefully downstream. While this might be a necessary workaround for Dynamo, it severely impacts code readability and maintainability. Please add a detailed comment explaining why this is necessary and what it achieves. A better long-term solution would be to find a more explicit way to register tensor dependencies for graph compilation.
| if not self.compilation_context.stream: | ||
| self.compilation_context.stream = torch_npu.npu.Stream() | ||
|
|
||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
This file is in an NPU-specific path, but it calls torch.cuda.synchronize(). This should be torch.npu.synchronize() to ensure correct synchronization on NPU devices. Using the CUDA version here could lead to incorrect behavior or errors when running on NPUs.
| torch.cuda.synchronize() | |
| torch.npu.synchronize() |
| flatten = positions.flatten() | ||
| cos_sin = cos_sin_cache.index_select(0, flatten) | ||
|
|
||
| reshape = cos_sin.reshape(-1, 2, 64) |
There was a problem hiding this comment.
The value 64 is hardcoded in reshape. This appears to be related to the head dimension. Hardcoding this value makes the pass brittle and will likely cause it to fail for models with different head dimensions. This should be derived from a configuration parameter, such as self.head_dim // 2, to ensure the pass is general and robust.
| reshape = cos_sin.reshape(-1, 2, 64) | |
| reshape = cos_sin.reshape(-1, 2, self.head_dim // 2) |
| def __init__( | ||
| self, | ||
| capture_sizes: List[int], | ||
| capture_sizes: List[int] = [], | ||
| compiler: str = "eager", | ||
| enable_debug_mode: bool = False, | ||
| splitting_ops: List[str] = [], | ||
| ): | ||
| self.traced_files = set() | ||
| self.capture_sizes = capture_sizes | ||
| self.compiler = compiler | ||
| self.enable_debug_mode = enable_debug_mode | ||
| self.splitting_ops = splitting_ops |
There was a problem hiding this comment.
Using mutable default arguments like [] is a common pitfall in Python. The same list instance will be shared across all CompilationConfig objects created without explicitly providing capture_sizes or splitting_ops. If one instance modifies its list, it will affect all other instances. To avoid this potential bug, you should use None as the default value and initialize a new list inside __init__ if the argument is None. You will also need to import Optional from typing.
| def __init__( | |
| self, | |
| capture_sizes: List[int], | |
| capture_sizes: List[int] = [], | |
| compiler: str = "eager", | |
| enable_debug_mode: bool = False, | |
| splitting_ops: List[str] = [], | |
| ): | |
| self.traced_files = set() | |
| self.capture_sizes = capture_sizes | |
| self.compiler = compiler | |
| self.enable_debug_mode = enable_debug_mode | |
| self.splitting_ops = splitting_ops | |
| def __init__( | |
| self, | |
| capture_sizes: Optional[List[int]] = None, | |
| compiler: str = "eager", | |
| enable_debug_mode: bool = False, | |
| splitting_ops: Optional[List[str]] = None, | |
| ): | |
| self.traced_files = set() | |
| self.capture_sizes = capture_sizes if capture_sizes is not None else [] | |
| self.compiler = compiler | |
| self.enable_debug_mode = enable_debug_mode | |
| self.splitting_ops = splitting_ops if splitting_ops is not None else [] |
| q_weight: torch.Tensor, | ||
| k_weight: torch.Tensor, | ||
| q_hidden_size: int, | ||
| kv_hiddem_size: int, |
There was a problem hiding this comment.
|
|
||
|
|
||
| class Submodule(torch.nn.Module): | ||
| block_tables = None |
There was a problem hiding this comment.
The block_tables attribute is defined as a class variable and is modified in forward_with_calculation. This stateful design, where one method call (forward_with_calculation) sets a class-level state that a subsequent call (forward) depends on, is fragile and can lead to subtle bugs, especially in concurrent environments. It would be cleaner and safer to pass block_tables as an argument to the forward method or manage this state within an instance rather than at the class level.
| ops_count = 3 | ||
| ops_step = ops_count + 1 |
There was a problem hiding this comment.
The magic numbers ops_count = 3 and ops_step = 4 make this graph splitting logic hard to understand and maintain. It seems to be looking for a specific pattern of nodes. Please add comments to explain what this pattern is and why these specific numbers are used. Consider defining them as named constants with descriptive names if they represent a fixed pattern.
| `PassManager` is implemented here: [PassManager](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/pass_manager.py) | ||
|
|
||
|
|
||
| You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backed uses `PassManager` too via `NpuGraphCompilerBackend` inheritance. |
There was a problem hiding this comment.
There is a typo here. "compiler backed" should be "compiler backend".
| You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backed uses `PassManager` too via `NpuGraphCompilerBackend` inheritance. | |
| You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backend uses `PassManager` too via `NpuGraphCompilerBackend` inheritance. |
| After pass development you should create `PassManager` instance, add the pass and call `apply` method: | ||
| ``` | ||
| def apply_passes(self, graph_module: torch.fx.GraphModule): | ||
| passManager = PassManager(graph_module) |
There was a problem hiding this comment.
|
Can this pr work on A2 NPU?If it can work efficiently, pls give examples for som models, e.g. qwen-32b or qwen3-next-80b. Thank you. |
Motivation
Piecewise for NPU Graph, based on: #11104
This Pull Request is part of architecture update to implement hardware independence layer on
Qwen3model as example. Other models will be supported later. Implemented changes make the model compilable. Implemented in the Pull RequestPassManagerand fusing passes (forfp16and quantized models) based ontorch.fx.replaceremove hardware specific logic fromQwen3model and allow to reuse the same Python models for different hardware with hardware specific optimizations (passes:SplitQkvRmsnormRopeFuse,NpuAddRmsNormQuantFuse,NpuAddRmsNormDynamicQuantFuse). Implemented compiler backends in this Pull Request specify usage of optimal inference for chosen hardware. Additionally, Pull Request has performance gain forfp16and quantized models.sglang.benchserving: Ascend 910A3, batch size = 32, 64, 128:--enable-torch-compileGSM8K Ascend 910A2:
Latency: 826.420 s
Output throughput: 454.471 token/s
Latency: 536.163 s
Output throughput: 684.102 token/s
Latency: 378.931 s
Output throughput: 944.312 token/s
--enable-torch-compileLatency: 798.009 s
Output throughput: 480.432 token/s
Latency: 528.144 s
Output throughput: 720.576 token/s
Latency: 367.730 s
Output throughput: 969.680 token/s
PassManagerfor current and future fuses in Python viatorch.fx.replace_pattern. Fuses can be easily developed by external contributors.AddRmsNormandAscendQuantV2kernels toAddRmsNormQuantkernel:Original comment: [feat] npu support enable_torch_compile #12371
TorchAir (Torch Ascend Intermediate Representation) is an extension library that provides graph mode capabilities for torch_npu. It enables users to perform graph-mode inference on NPU using PyTorch and torch_npu. TorchAir externally offers a torch.compile backend for NPU, which interfaces with torch._dynamo. Through the following features, performance optimization and capability enhancement of the torch fx graph can be achieved.
TorchAir Main Features:
How to enable compilation and fuses for
NPUGraphdecode:Variant 1: similar as other options:
Variant 2: more general with customization in the future:
How to enable piecewise graph and fuses for decode:
Variant 1: similar as other options with default compilation parameters:
Variant 2: more general with customization:
splitting_opskey is optionalHow to enable TorchAir for decode:
Variant 1: similar as other options:
Variant 2: more general with customization in the future:
CANN version: 8.2
Torch NPU version:
torch-npu 2.6.0.post3NpuAddRmsNormQuantFusepassNpuAddRmsNormQuantFusepass fusesAddRmsNormandAscentQuanttoAddRmsNormQuant.Before fuse: NPU kernels

AddRmsNormandAscentQuantusage which take 29 microseconds:After fuse: NPU kernel

AddRmsNormQuantusage which takes 19 microseconds:SplitQkvRmsnormRopeFusepassBefore fuse: NPU kernels

RmsNorm,RmsNormandRopeWithSinCosCacheusage which take 62 microseconds:After fuse: NPU kernel

split_kqv_rmsnorm_ropeusage which takes 25 microseconds:Modifications
Model compilation support by
torch.compileUse
--enable-torch-compileto enable compilation and optional--torch-compile-max-bsargument to limit max batch size for compilation.NpuGraphCompilerBackendcompilation backend for NPU Graph capturing. Implemented in:python/sglang/srt/model_executor/compilation/npu_graph_compiler_backend.py, usage:PiecewiseNpuGraphCompilerBackendcompilation backend for Piecewise graph and partial NPU Graph capturing. Inherited fromNpuGraphCompilerBackendto reuse fusing passes. Implemented in:python/sglang/srt/model_executor/compilation/piecewise_npu_graph_compiler_backend.py, usage:You can use
--enable-piecewise-npu-graph-decodeto enable Piecewise Graph.Optional command line arguments:
--compilation-config {"splitting_ops": ["atb._npu_paged_attention"]}to configure compilation backend,--cuda-graph-bsto specify batch size,--cuda-graph-max-bsto limit max batch size.PassManagerpasses manager and passespython/sglang/srt/model_executor/compilation/passes/w8a8_int8andpython/sglang/srt/compilation/npu/passes/fp16.pyto optimize model during compilation. Usage:RotaryEmbeddinglayer use NPU kernel in forward instead native implementationpython/sglang/srt/layers/attention/ascend_backend.py7.1. Rewrite the capture function;
7.2. Encapsulate the kvcache input (input needs all kvcache);
7.3. Pad the block table to the max length;
7.4. TorchAir input preparation;
The calling process is as follows.

Class Diagram
classDiagram class PiecewiseNpuGraphRunnerDecode class NPUCompileModelRunner class NPUGraphRunner class CudaGraphRunner class NpuGraphCompiler class NpuGraphCompilerBackend class PiecewiseNpuGraphCompiler class PiecewiseNpuGraphCompilerBackend NPUGraphRunner--|>CudaGraphRunner NPUGraphRunner-->NpuGraphCompiler NpuGraphCompiler-->NpuGraphCompilerBackend NPUCompileModelRunner-->CudaGraphRunner PiecewiseNpuGraphRunnerDecode-->CudaGraphRunner PiecewiseNpuGraphRunnerDecode-->PiecewiseNpuGraphCompiler PiecewiseNpuGraphCompiler-->PiecewiseNpuGraphCompilerBackend PiecewiseNpuGraphCompilerBackend--|>NpuGraphCompilerBackendAccuracy Tests
Collected on gsm8k dataset for static quantized
Qwen3-32B:TorchAir
Collected on MMMU dataset for
Qwen3-VL-30B-A3B-Instruct:Benchmarking and Profiling (910A3)
Reference
Compilation
Future roadmaps
In the
torch_npu7.2.0 version, the reduce-overhead mode of the torchair backend will support torch.compile(model, dynamic=True). This mode will be set as the default in get_compile_backend(), enabling support for methods wrapped by the@torch.compile()decorator.In the
torch_npu7.3.0 version, the capture and replay ofNPUGraphcurrently integrated in the torchair backend will be changed to optional execution. The torchair backend will only perform optimizations such as fx pass optimization and static kernel compilation, while the capture and replay ofNPUGraphwill be implemented independently. This design is closer to the implementation ofCudaGraphRunner, decoupling fx graph optimization from graph offloading.Checklist