Add AscendC triangular inverse using AIC and AIV by gioelegott · Pull Request #366 · sgl-project/sgl-kernel-npu

gioelegott · 2026-02-04T16:01:30Z

This MR contributes an AscendC triangular inverse kernel that implements a column sweep algorithm using vector cores and cube core. This algorithms is based on @zouzias algorithm and takes advantage of all the NPU's resources.

This is joint work with @asobczyk , @zouzias and @learning-chip .

The users of chunk_gated_delta_rule_native can use the new kernel by setting tri_inv_fn = sgl_kernel_npu.fla.chunk.fast_cube_inv_tril.

The kernel torch.ops.npu.tri_inv_cube is tested on Ascend A2 and 910B4 (x86_64).The kernel supports only fp16 due to the cube core limitations. This may impact e2e accuracy, but gives great performance improvements for large batch sizes.

Performance

batch_size	chunk_size	triton_time_us	ascendc_aiv_time_us		speed_up aic vs triton
256	16	386	151	346	1.115606936
384	16	292	196	469	0.622601279
512	16	384	239	590	0.650847458
640	16	431	285	703	0.613086771
768	16	524	335	827	0.633615478
896	16	614	381	942	0.651804671
1024	16	663	427	1064	0.623120301
1152	16	754	469	1189	0.634146341
128	32	423	266	346	1.222543353
192	32	576	323	469	1.228144989
256	32	731	427	590	1.238983051
320	32	884	484	703	1.257467994
384	32	1082	592	827	1.30834341
448	32	1240	701	942	1.316348195
512	32	1394	764	1064	1.310150376
576	32	1545	870	1189	1.29941127
64	64	835	487	367	2.27520436
96	64	1233	690	439	2.808656036
128	64	1592	903	580	2.744827586
160	64	1943	908	669	2.904334828
192	64	2338	1121	785	2.978343949
224	64	2698	1333	905	2.98121547
256	64	3088	1545	1009	3.060455897
288	64	3449	1761	1126	3.063055062

gemini-code-assist · 2026-02-04T16:02:01Z

Summary of Changes

Hello @gioelegott, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the sgl_kernel_npu library by integrating a new, highly optimized AscendC kernel for triangular matrix inversion. This kernel, tri_inv_cube_col_sweep, is designed to exploit the unique architecture of Ascend NPUs, specifically utilizing both Vector Cores and the Cube Core to accelerate the column sweep algorithm. While currently restricted to fp16 precision, it delivers substantial performance improvements for large-scale operations, providing a faster alternative for applications requiring efficient triangular matrix inversion. The changes include the kernel implementation, host-side integration, Python bindings, and extensive testing to validate its functionality and performance.

Highlights

New AscendC Triangular Inverse Kernel: Introduced tri_inv_cube_col_sweep, an AscendC kernel for triangular matrix inversion, leveraging both Vector Cores (AIV) and Cube Core (AIC) on Ascend NPUs.
Column Sweep Algorithm: The kernel implements a column sweep algorithm, optimized to take full advantage of NPU resources.
FP16 Support: The new kernel currently supports only fp16 data type due to Cube Core limitations, which might impact end-to-end accuracy but offers significant performance gains.
Performance Improvements: Benchmarks show considerable speed-ups for large batch sizes compared to existing implementations.
Python API Integration: A new Python function sgl_kernel_npu.fla.chunk.fast_cube_inv_tril is added, allowing users to easily utilize the new kernel.
Comprehensive Testing: New and updated test cases ensure the correctness and performance of the cube_triangular_inverse operation across various configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/CMakeLists.txt
- Added build configurations for the new tri_inv_cube.cpp host-side code and tri_inv_cube_kernel.cpp kernel.
csrc/pytorch_extensions.cpp
- Registered cube_triangular_inverse as a new PyTorch operator, mapping it to the sglang::npu_kernel::tri_inv_cube_col_sweep implementation.
csrc/tri_inv/README.md
- Updated the description to differentiate between AIV-based and AIC-based triangular inversion kernels and their supported data types and matrix sizes.
csrc/tri_inv/op_host/tiling_tri_inv_cube.h
- Introduced a new header defining the TriInvColumnSweepCubeTiling structure for kernel configuration.
csrc/tri_inv/op_host/tri_inv_cube.cpp
- Implemented the host-side logic for the tri_inv_cube_col_sweep kernel, including tiling parameter calculation and kernel launch, with fp16 and square matrix validation.
csrc/tri_inv/op_kernel/kernel_mat_gen.h
- Added the KernelMatGen class, an AIV-side component responsible for generating matrices for the column sweep algorithm and managing synchronization.
csrc/tri_inv/op_kernel/kernel_tri_inv_cube.h
- Defined the KernelTriInvCubeColSweep class, the AIC-side component performing matrix multiplications and data transfers, orchestrating the triangular inverse computation.
csrc/tri_inv/op_kernel/tri_inv_cube_kernel.cpp
- Provided the entry point for the tri_inv_cube_col_sweep_fp16 kernel, linking host-side parameters to the kernel execution.
include/sgl_kenel_npu_ops.h
- Declared the new tri_inv_cube_col_sweep function and clarified the existing tri_inv_col_sweep as AIV-based.
python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py
- Exposed the new kernel through fast_cube_inv_tril, a Python wrapper that handles input preparation for the cube_triangular_inverse operator.
tests/python/sgl_kernel_npu/test_gated_delta_ascendc_tri_inv.py
- Updated tests to include fast_cube_inv_tril in parametrization and adjusted assertion tolerances to accommodate fp16 precision.
tests/python/sgl_kernel_npu/test_triangular_inverse.py
- Added NPU device setup, a NumPy reference implementation (np_tril_inv_cube_cs), and new test cases for the cube_triangular_inverse kernel, along with modifications to existing tests for consistency.

Activity

A new AscendC kernel tri_inv_cube_col_sweep has been implemented for triangular matrix inversion.
The kernel leverages both Vector Cores (AIV) and Cube Core (AIC) for optimized performance.
Performance benchmarks are included in the PR description, demonstrating significant speed-ups for large batch sizes.
Python bindings and a new API function fast_cube_inv_tril have been added for easy integration.
Extensive unit tests have been added and updated to validate the new kernel's correctness and performance characteristics, including fp16 precision considerations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new AscendC kernel for triangular matrix inversion using AI Cores (AIC) and AI Vectors (AIV), aiming for better performance on large batches. The changes span the C++ kernel implementation, host-side logic, PyTorch op registration, and associated Python wrappers and tests. While the performance gains are notable, my review has uncovered several critical correctness issues. There appears to be a bug in the core inversion algorithm in kernel_mat_gen.h where an iteration is missed. The Python wrapper fast_cube_inv_tril seems to pass a matrix with the wrong memory layout (lower triangular instead of upper triangular) to the new kernel. Furthermore, a critical issue in the test file test_triangular_inverse.py causes one of the main tests to be skipped due to function redefinition. I've also noted several medium-severity issues, including very high test tolerances that could mask bugs, and areas for code cleanup to improve maintainability. These critical issues should be addressed to ensure the correctness of this new high-performance kernel.

csrc/tri_inv/op_kernel/kernel_mat_gen.h

python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py

tests/python/sgl_kernel_npu/test_triangular_inverse.py

csrc/tri_inv/op_host/tri_inv_cube.cpp

csrc/tri_inv/op_kernel/kernel_mat_gen.h

csrc/tri_inv/op_kernel/kernel_tri_inv_cube.h

tests/python/sgl_kernel_npu/test_gated_delta_ascendc_tri_inv.py

zouzias

Left a few minor comments.

csrc/tri_inv/op_kernel/tri_inv_cube_kernel.cpp

tests/python/sgl_kernel_npu/test_triangular_inverse.py

anastasios and others added 19 commits January 21, 2026 08:09

(ci) setup

1cab681

Merge branch 'main' into dev

5313c75

(ci) build dev branch

01c2486

Merge branch 'main' into dev

e452238

Merge branch 'main' into dev

8fd4910

Merge branch 'main' into dev

738c9e2

Integrate cube column sweep

9788b37

Add matrix generation

1593361

Split kernels

36a4134

Fix workspace

76f3a5f

Add sync

40abc82

Simple test

f807a90

transpose madness

bee1bcf

Fix inv cube -- write to different GM locations

e5ba84a

Reintroduce old tests

141f6a5

Fix

f4c765f

Fix

5667cb6

Implement circular buffering

fd47aad

Remove file

b8680a0

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

ggottardo added 3 commits February 5, 2026 08:48

Address comments

fe11256

Fix

97c11f6

Fix

e0927cd

zouzias reviewed Feb 5, 2026

View reviewed changes

csrc/tri_inv/op_kernel/tri_inv_cube_kernel.cpp Show resolved Hide resolved

tests/python/sgl_kernel_npu/test_triangular_inverse.py Outdated Show resolved Hide resolved

Fix

3791fef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AscendC triangular inverse using AIC and AIV#366

Add AscendC triangular inverse using AIC and AIV#366
gioelegott wants to merge 23 commits intosgl-project:mainfrom
gioelegott:6-triinv-integrate-tri_inv_cube_col_sweep-kernel

gioelegott commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zouzias left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gioelegott commented Feb 4, 2026

Performance

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zouzias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants