Skip to content

[feat] npu support enable_torch_compile#12371

Closed
XDaoHong wants to merge 1 commit intosgl-project:mainfrom
XDaoHong:main
Closed

[feat] npu support enable_torch_compile#12371
XDaoHong wants to merge 1 commit intosgl-project:mainfrom
XDaoHong:main

Conversation

@XDaoHong
Copy link
Contributor

@XDaoHong XDaoHong commented Oct 30, 2025

Motivation

TorchAir (Torch Ascend Intermediate Representation) is an extension library that provides graph mode capabilities for torch_npu. It enables users to perform graph-mode inference on NPU using PyTorch and torch_npu. TorchAir externally offers a torch.compile backend for NPU, which interfaces with torch._dynamo. Through the following features, performance optimization and capability enhancement of the torch fx graph can be achieved.

image

Main Features:

  1. Basic Features:
  • Enable NPU kernels that depend on host-value tiling operators (e.g., FIA) to support npugraph
  • Graph input copy optimization
  • Memory reuse across multi-graphs
  1. FX Pass:
  • In-place optimization
  • Redundant operator elimination
  • NPU fused operator passes
  1. Advanced Features:
  • Static shape kernel compilation
  • Multi-stream within single graphs
  • Compilation caching

Modifications

  1. Rewrite the capture function;
  2. Encapsulate the kvcache input (input needs all kvcache);
  3. Pad the block table to the max length;
  4. TorchAir input preparation;

The calling process is as follows.
torchair调用流程

Accuracy Tests

python3 few_shot_gsm8k.py --data-path "/path/to/model/test.jsonl.txt” --parallel 32 --num-questions 200

Accuracy: 0.865
Invalid: 0.000
Latency: 43.077 s
Output throughput: 795.877 token/s

Benchmarking and Profiling

Future roadmaps

In the torch_npu 7.2.0 version, the reduce-overhead mode of the torchair backend will support torch.compile(model, dynamic=True). This mode will be set as the default in get_compile_backend(), enabling support for methods wrapped by the @torch.compile() decorator.
In the torch_npu 7.3.0 version, the capture and replay of NPUGraph currently integrated in the torchair backend will be changed to optional execution. The torchair backend will only perform optimizations such as fx pass optimization and static kernel compilation, while the capture and replay of NPUGraph will be implemented independently. This design is closer to the implementation of CudaGraphRunner, decoupling fx graph optimization from graph offloading.

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ping1jing2 ping1jing2 marked this pull request as draft October 30, 2025 09:39
@ssshinigami
Copy link
Contributor

Could you please add description and motivation
And accuracy and performance measurements

@XDaoHong XDaoHong force-pushed the main branch 4 times, most recently from 0069c10 to cd25a86 Compare November 3, 2025 02:06
@iforgetmyname iforgetmyname marked this pull request as ready for review November 5, 2025 03:12
@iforgetmyname iforgetmyname marked this pull request as draft November 5, 2025 06:47
for k, v in predefined_config.items():
setattr(compiler_config.experimental_config, k, v)

compiler_config.mode = "max-autotune" if mode is None else mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment "TODO(iforgetmyname): Change this default value once CANN version 8.3.RC1" to help me remember to change this default value to reduce-overhead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@XDaoHong XDaoHong force-pushed the main branch 2 times, most recently from e14ec76 to 9f46b8a Compare November 7, 2025 04:07
not get_moe_a2a_backend().is_none()
or should_use_flashinfer_cutlass_moe_fp4_allgather()
(
not get_moe_a2a_backend().is_none()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this part

Copy link
Contributor

@eshoguli eshoguli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment about NPUGraphRunner updates. NPUGraphRunner is inherited from CudaGraphRunner and reuse capturing (ForwardBatch instantiation, initialization and capturing functionality) and partially replay pipeline from inherited type. NPUGraph python type is used.

Your changes don't use anything from that. You have custom implementation of capturing and don't use NPUGraph python type for inference. As result, you need to implement separate runner: TorchAirRunner or something like that.

Can you, please, explain offline/online: why you need to use NpuGraphRunner? thanks!

…E_FIA

Co-authored-by: ZhengdQin <zhengdqin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants