sgl-project · eshoguli · Sep 17, 2025 · Oct 31, 2025 · Nov 1, 2025 · Nov 5, 2025
diff --git a/docs/platforms/ascend_npu_pass_development.md b/docs/platforms/ascend_npu_pass_development.md
@@ -0,0 +1,30 @@
+## How to transform model instances with PyTorch FX Toolkit in SGLang for NPU
+
+### PassManager
+`PassManager` is implemented here: [PassManager](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/pass_manager.py)
+
+
+You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backed uses `PassManager` too via `NpuGraphCompilerBackend` inheritance.
-You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backed uses `PassManager` too via `NpuGraphCompilerBackend` inheritance.
+You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backend uses `PassManager` too via `NpuGraphCompilerBackend` inheritance.
-You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backed uses `PassManager` too via `NpuGraphCompilerBackend` inheritance.
+You can explore `PassManager` usage in [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) compiler backend. [`PiecewiseNpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/piecewise_npu_graph_compiler_backend.py) compiler backend uses `PassManager` too via `NpuGraphCompilerBackend` inheritance.
+
+### Pass development
+There are two approaches to develop passes for SGLang NPU PassManager:
+
+1. Matches all possible non-overlapping sets of operators and their data dependencies with `torch.fx.replace_pattern` api.
+Pass example: [NpuAddRmsNormQuantFuse](https://github.com/eshoguli/sglang/blob/3365d711fd5aa0d6191c32769163320fe41e27f2/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/passes/w8a8_int8.py#L82).
+You can find details on official FX toolkit web site: https://docs.pytorch.org/docs/stable/fx.html#subgraph-rewriting-with-replace-pattern
+
+2. Direct Graph Manipulation.
+Pass example: [EraseCopy](https://github.com/eshoguli/sglang/blob/3365d711fd5aa0d6191c32769163320fe41e27f2/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/passes/w8a8_int8.py#L28).
+You can find details on official FX toolkit web site: https://docs.pytorch.org/docs/stable/fx.html#direct-graph-manipulation
+
+### Compiler backend update
+After pass development you should create `PassManager` instance, add the pass and call `apply` method:
+```
+def apply_passes(self, graph_module: torch.fx.GraphModule):
+    passManager = PassManager(graph_module)
-    passManager = PassManager(graph_module)
+    pass_manager = PassManager(graph_module)
-    passManager = PassManager(graph_module)
+    pass_manager = PassManager(graph_module)
+    passManager.add(NpuAddRmsNormQuantFuse)
+    passManager.apply()
+    graph_module.recompile()
+```
+
+You can explore [`NpuGraphCompilerBackend`](https://github.com/eshoguli/sglang/blob/eshogulin/pass_manager/python/sglang/srt/hardware_backend/npu/graph_runner/compilation/npu_graph_compiler_backend.py) as example.
diff --git a/docs/platforms/ascend_npu_support.rst b/docs/platforms/ascend_npu_support.rst
@@ -6,4 +6,5 @@ Ascend NPUs
 
    ascend_npu.md
    ascend_npu_deepseek_example.md
+   ascend_npu_pass_development.md
    ascend_npu_qwen3_examples.md
@@ -1,20 +1,23 @@
 # Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/compilation_config.py
 
+import json
 from typing import List
 
 
 # TODO(Yuwei): support better compile config support
 class CompilationConfig:
     def __init__(
         self,
-        capture_sizes: List[int],
+        capture_sizes: List[int] = [],
         compiler: str = "eager",
         enable_debug_mode: bool = False,
+        splitting_ops: List[str] = [],
     ):
         self.traced_files = set()
         self.capture_sizes = capture_sizes
         self.compiler = compiler
         self.enable_debug_mode = enable_debug_mode
+        self.splitting_ops = splitting_ops
-    def __init__(
-        self,
-        capture_sizes: List[int],
-        capture_sizes: List[int] = [],
-        compiler: str = "eager",
-        enable_debug_mode: bool = False,
-        splitting_ops: List[str] = [],
-    ):
-        self.traced_files = set()
-        self.capture_sizes = capture_sizes
-        self.compiler = compiler
-        self.enable_debug_mode = enable_debug_mode
-        self.splitting_ops = splitting_ops
+    def __init__(
+        self,
+        capture_sizes: Optional[List[int]] = None,
+        compiler: str = "eager",
+        enable_debug_mode: bool = False,
+        splitting_ops: Optional[List[str]] = None,
+    ):
+        self.traced_files = set()
+        self.capture_sizes = capture_sizes if capture_sizes is not None else []
+        self.compiler = compiler
+        self.enable_debug_mode = enable_debug_mode
+        self.splitting_ops = splitting_ops if splitting_ops is not None else []
-    def __init__(
-        self,
-        capture_sizes: List[int],
-        capture_sizes: List[int] = [],
-        compiler: str = "eager",
-        enable_debug_mode: bool = False,
-        splitting_ops: List[str] = [],
-    ):
-        self.traced_files = set()
-        self.capture_sizes = capture_sizes
-        self.compiler = compiler
-        self.enable_debug_mode = enable_debug_mode
-        self.splitting_ops = splitting_ops
+    def __init__(
+        self,
+        capture_sizes: Optional[List[int]] = None,
+        compiler: str = "eager",
+        enable_debug_mode: bool = False,
+        splitting_ops: Optional[List[str]] = None,
+    ):
+        self.traced_files = set()
+        self.capture_sizes = capture_sizes if capture_sizes is not None else []
+        self.compiler = compiler
+        self.enable_debug_mode = enable_debug_mode
+        self.splitting_ops = splitting_ops if splitting_ops is not None else []
 
     def add_traced_file(self, file_path: str):
         self.traced_files.add(file_path)
@@ -25,5 +28,13 @@ def get_traced_files(self):
     def get_capture_sizes(self):
         return self.capture_sizes
 
+    @classmethod
+    def from_cli(cls, args) -> "CompilationConfig":
+        args_dict = json.loads(args)
+        return CompilationConfig(**args_dict)
+
     def get_enable_debug_mode(self):
         return self.enable_debug_mode
+
+    def get_splitting_ops(self):
+        return self.splitting_ops
@@ -0,0 +1,52 @@
+# Copyright 2025 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from typing import List, Optional
+
+import torch
+
+import sglang.srt.layers.dp_attention
+
+
+@torch.library.custom_op("sglang::_set_dp_buffer_len", mutates_args=())
+def _set_dp_buffer_len(
+    global_dp_buffer_len: Optional[int],
+    num_tokens: Optional[int],
+    is_max_len: bool,
+    global_num_tokens: Optional[List[int]] = None,
+) -> None:
+    global set_dp_buffer_len_original
+    sglang.srt.layers.dp_attention.set_dp_buffer_len(
+        global_dp_buffer_len, num_tokens, is_max_len, global_num_tokens
+    )
+
+
+@_set_dp_buffer_len.register_fake
+def _set_dp_buffer_len_fake(
+    global_dp_buffer_len: Optional[int],
+    num_tokens: Optional[int],
+    is_max_len: bool,
+    global_num_tokens: Optional[List[int]] = None,
+) -> None:
+    pass
+
+
+@torch.library.custom_op("sglang::_set_is_extend_in_batch", mutates_args=())
+def _set_is_extend_in_batch(is_extend_in_batch: bool) -> None:
+    sglang.srt.layers.dp_attention.set_is_extend_in_batch(is_extend_in_batch)
+
+
+@_set_is_extend_in_batch.register_fake
+def _set_is_extend_in_batch_fake(is_extend_in_batch: bool) -> None:
+    pass
@@ -231,6 +231,9 @@ class AscendAttnBackend(AttentionBackend):
 
     def __init__(self, model_runner: ModelRunner):
         super().__init__()
+        self.enable_piecewise_npu_graph_decode = (
+            model_runner.server_args.enable_piecewise_npu_graph_decode
+        )
         self.forward_metadata = None
         self.device = model_runner.device
         self.page_size = model_runner.page_size
@@ -248,7 +251,6 @@ def __init__(self, model_runner: ModelRunner):
         self.req_to_token = model_runner.req_to_token_pool.req_to_token
         self.graph_mode = False
         self.use_fia = get_bool_env_var("ASCEND_USE_FIA", "False")
-        self.enable_torch_compile = model_runner.server_args.enable_torch_compile
         self.speculative_num_draft_tokens = (
             model_runner.server_args.speculative_num_draft_tokens
         )
@@ -264,6 +266,11 @@ def __init__(self, model_runner: ModelRunner):
         if self.use_mla:
             self.ringmla_mask = self.ascend_attn_mask_builder.ringmla_mask
 
+        self.enable_torchair_compile = model_runner.server_args.enable_torchair_compile
+        if self.enable_torchair_compile:
+            max_total_tokens = model_runner.max_total_num_tokens
+            self.max_seqlen_pad = max_total_tokens // model_runner.server_args.page_size
+
     def get_verify_buffers_to_fill_after_draft(self):
         """
         Return buffers for verify attention kernels that needs to be filled after draft.
@@ -283,12 +290,29 @@ def init_forward_metadata(self, forward_batch: ForwardBatch):
         seq_lens_max = forward_batch.seq_lens.max()
         if forward_batch.forward_mode.is_target_verify():
             seq_lens_max += self.speculative_num_draft_tokens
-        self.forward_metadata.block_tables = (
+
+        block_tables = (
             forward_batch.req_to_token_pool.req_to_token[
                 forward_batch.req_pool_indices, :seq_lens_max
             ][:, :: self.page_size]
             // self.page_size
         )
+
+        if (
+            self.enable_torchair_compile
+            and forward_batch.forward_mode.is_decode_or_idle()
+        ):
+            bs = forward_batch.input_ids.size(0)
+            device = forward_batch.input_ids.device
+            self.forward_metadata.block_tables = torch.full(
+                (bs, self.max_seqlen_pad), -1, dtype=torch.int32, device=device
+            )
+            self.forward_metadata.block_tables[:, : block_tables.size(1)].copy_(
+                block_tables
+            )
+        else:
+            self.forward_metadata.block_tables = block_tables
+
         if forward_batch.extend_seq_lens is not None:
             self.forward_metadata.extend_seq_lens_cpu_int = (
                 forward_batch.extend_seq_lens.cpu().int()
@@ -1145,6 +1169,17 @@ def forward_decode_graph(
                 else:
                     actual_seq_len_kv = self.forward_metadata.seq_lens_cpu_int
 
+                if (
+                    self.enable_piecewise_npu_graph_decode
+                    and torch.compiler.is_dynamo_compiling()
+                ):
+                    # input args for submodule forward
+                    forward_batch.req_to_token_pool.req_to_token.add_(
+                        forward_batch.req_to_token_pool.req_to_token
+                    )
+                    forward_batch.req_pool_indices.add_(forward_batch.req_pool_indices)
+                    forward_batch.seq_lens.add_(forward_batch.seq_lens)
+
                 torch_npu._npu_paged_attention(
                     query=query,
                     key_cache=k_cache,
@@ -1256,7 +1291,7 @@ def forward_decode(
                 topk_indices,
             )
 
-        if self.graph_mode and (not self.enable_torch_compile):
+        if self.graph_mode and (not self.enable_torchair_compile):
             return self.forward_decode_graph(
                 q,
                 k,

@@ -1,5 +1,7 @@
 import torch
 
+from sglang.srt.layers.parameter import ModelWeightParameter
+
 cmo_stream = None
 
 
@@ -18,6 +20,12 @@ def set_cmo_stream(stream):
     cmo_stream = stream
 
 
+def get_weight_cache(layer):
+    if isinstance(layer.weight, ModelWeightParameter):
+        return layer.weight_data
+    return layer.weight
+
+
 def prepare_weight_cache(handle, cache, PREFETCH_MAX_SIZE=1000000000):
     """
     PREFETCH_MAX_SIZE: maximum size (bytes) for each prefetch operation.

@@ -0,0 +1,39 @@
+from typing import List
+
+import torch
+
+import sglang.srt.hardware_backend.npu.cmo
+from sglang.srt.utils import direct_register_custom_op
+
+
+@torch.library.custom_op("sglang::wait_cmo_stream", mutates_args=())
+def wait_cmo_stream() -> None:
+    if sglang.srt.hardware_backend.npu.cmo.get_cmo_stream():
+        sglang.srt.hardware_backend.npu.cmo.wait_cmo_stream()
+
+
+@wait_cmo_stream.register_fake
+def wait_cmo_stream_fake() -> None:
+    pass
+
+
+def get_cmo_stream() -> bool:
+    return True
+
+
+def prepare_weight_cache(handle: torch.Tensor, cache: List[torch.Tensor]) -> None:
+    sglang.srt.hardware_backend.npu.cmo.prepare_weight_cache(handle, cache)
+
+
+def prepare_weight_cache_register_fake(
+    handle: torch.Tensor, cache: List[torch.Tensor]
+) -> None:
+    pass
+
+
+direct_register_custom_op(
+    op_name="prepare_weight_cache",
+    op_func=prepare_weight_cache,
+    mutates_args=["handle"],
+    fake_impl=prepare_weight_cache_register_fake,
+)
@@ -0,0 +1,20 @@
+# Copyright 2023-2025 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import torch_npu
+
+
+class CompilationContext:
+    graph_memory_pool = None
+    stream: torch_npu.npu.Stream = None
@@ -0,0 +1,55 @@
+from typing import List
+
+import sgl_kernel_npu.norm.split_qkv_rmsnorm_rope
+import torch
+
+
+@torch.library.custom_op("sglang::split_qkv_rmsnorm_rope", mutates_args=())
+def split_qkv_rmsnorm_rope(
+    input: torch.Tensor,
+    sin: torch.Tensor,
+    cos: torch.Tensor,
+    q_weight: torch.Tensor,
+    k_weight: torch.Tensor,
+    q_hidden_size: int,
+    kv_hiddem_size: int,
-    kv_hiddem_size: int,
+    kv_hidden_size: int,
-    kv_hiddem_size: int,
+    kv_hidden_size: int,
+    head_dim: int,
+    eps: float,
+    q_bias: torch.Tensor,
+    k_bias: torch.Tensor,
+) -> List[torch.Tensor]:
+    q, k, v = sgl_kernel_npu.norm.split_qkv_rmsnorm_rope.split_qkv_rmsnorm_rope(
+        input,
+        sin,
+        cos,
+        q_weight,
+        k_weight,
+        q_hidden_size,
+        kv_hiddem_size,
+        head_dim,
+        eps,
+        q_bias,
+        k_bias,
+    )
+    return [q, k, v]
+
+
+@split_qkv_rmsnorm_rope.register_fake
+def split_qkv_rmsnorm_rope(
+    input: torch.Tensor,
+    sin: torch.Tensor,
+    cos: torch.Tensor,
+    q_weight: torch.Tensor,
+    k_weight: torch.Tensor,
+    q_hidden_size: int,
+    kv_hiddem_size: int,
+    head_dim: int,
+    eps: float,
+    q_bias: torch.Tensor,
+    k_bias: torch.Tensor,
+) -> List[torch.Tensor]:
+    # TODO: generalize shape
+    q = torch.empty((128, 2048), dtype=input.dtype, device=input.device)
+    k = torch.empty((128, 256), dtype=input.dtype, device=input.device)
+    v = torch.empty((128, 256), dtype=input.dtype, device=input.device)
+    return [q, k, v]