[feat] npu support enable_torch_compile#12371
[feat] npu support enable_torch_compile#12371XDaoHong wants to merge 1 commit intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Could you please add description and motivation |
0069c10 to
cd25a86
Compare
python/sglang/srt/utils/common.py
Outdated
| for k, v in predefined_config.items(): | ||
| setattr(compiler_config.experimental_config, k, v) | ||
|
|
||
| compiler_config.mode = "max-autotune" if mode is None else mode |
There was a problem hiding this comment.
please add a comment "TODO(iforgetmyname): Change this default value once CANN version 8.3.RC1" to help me remember to change this default value to reduce-overhead
e14ec76 to
9f46b8a
Compare
| not get_moe_a2a_backend().is_none() | ||
| or should_use_flashinfer_cutlass_moe_fp4_allgather() | ||
| ( | ||
| not get_moe_a2a_backend().is_none() |
There was a problem hiding this comment.
General comment about NPUGraphRunner updates. NPUGraphRunner is inherited from CudaGraphRunner and reuse capturing (ForwardBatch instantiation, initialization and capturing functionality) and partially replay pipeline from inherited type. NPUGraph python type is used.
Your changes don't use anything from that. You have custom implementation of capturing and don't use NPUGraph python type for inference. As result, you need to implement separate runner: TorchAirRunner or something like that.
Can you, please, explain offline/online: why you need to use NpuGraphRunner? thanks!
…E_FIA Co-authored-by: ZhengdQin <zhengdqin@gmail.com>
Motivation
TorchAir (Torch Ascend Intermediate Representation) is an extension library that provides graph mode capabilities for torch_npu. It enables users to perform graph-mode inference on NPU using PyTorch and torch_npu. TorchAir externally offers a
torch.compilebackend for NPU, which interfaces withtorch._dynamo. Through the following features, performance optimization and capability enhancement of the torch fx graph can be achieved.Main Features:
Modifications
The calling process is as follows.

Accuracy Tests
python3 few_shot_gsm8k.py --data-path "/path/to/model/test.jsonl.txt” --parallel 32 --num-questions 200
Benchmarking and Profiling
Future roadmaps
In the torch_npu 7.2.0 version, the reduce-overhead mode of the torchair backend will support torch.compile(model, dynamic=True). This mode will be set as the default in get_compile_backend(), enabling support for methods wrapped by the @torch.compile() decorator.
In the torch_npu 7.3.0 version, the capture and replay of NPUGraph currently integrated in the torchair backend will be changed to optional execution. The torchair backend will only perform optimizations such as fx pass optimization and static kernel compilation, while the capture and replay of NPUGraph will be implemented independently. This design is closer to the implementation of CudaGraphRunner, decoupling fx graph optimization from graph offloading.
Checklist