Skip to content

Add AscendC triangular inverse using AIC and AIV#366

Open
gioelegott wants to merge 23 commits intosgl-project:mainfrom
gioelegott:6-triinv-integrate-tri_inv_cube_col_sweep-kernel
Open

Add AscendC triangular inverse using AIC and AIV#366
gioelegott wants to merge 23 commits intosgl-project:mainfrom
gioelegott:6-triinv-integrate-tri_inv_cube_col_sweep-kernel

Conversation

@gioelegott
Copy link

This MR contributes an AscendC triangular inverse kernel that implements a column sweep algorithm using vector cores and cube core. This algorithms is based on @zouzias algorithm and takes advantage of all the NPU's resources.

This is joint work with @asobczyk , @zouzias and @learning-chip .

The users of chunk_gated_delta_rule_native can use the new kernel by setting tri_inv_fn = sgl_kernel_npu.fla.chunk.fast_cube_inv_tril.

The kernel torch.ops.npu.tri_inv_cube is tested on Ascend A2 and 910B4 (x86_64).The kernel supports only fp16 due to the cube core limitations. This may impact e2e accuracy, but gives great performance improvements for large batch sizes.

Performance

image
batch_size chunk_size triton_time_us ascendc_aiv_time_us speed_up aic vs triton
256 16 386 151 346 1.115606936
384 16 292 196 469 0.622601279
512 16 384 239 590 0.650847458
640 16 431 285 703 0.613086771
768 16 524 335 827 0.633615478
896 16 614 381 942 0.651804671
1024 16 663 427 1064 0.623120301
1152 16 754 469 1189 0.634146341
128 32 423 266 346 1.222543353
192 32 576 323 469 1.228144989
256 32 731 427 590 1.238983051
320 32 884 484 703 1.257467994
384 32 1082 592 827 1.30834341
448 32 1240 701 942 1.316348195
512 32 1394 764 1064 1.310150376
576 32 1545 870 1189 1.29941127
64 64 835 487 367 2.27520436
96 64 1233 690 439 2.808656036
128 64 1592 903 580 2.744827586
160 64 1943 908 669 2.904334828
192 64 2338 1121 785 2.978343949
224 64 2698 1333 905 2.98121547
256 64 3088 1545 1009 3.060455897
288 64 3449 1761 1126 3.063055062

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @gioelegott, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the sgl_kernel_npu library by integrating a new, highly optimized AscendC kernel for triangular matrix inversion. This kernel, tri_inv_cube_col_sweep, is designed to exploit the unique architecture of Ascend NPUs, specifically utilizing both Vector Cores and the Cube Core to accelerate the column sweep algorithm. While currently restricted to fp16 precision, it delivers substantial performance improvements for large-scale operations, providing a faster alternative for applications requiring efficient triangular matrix inversion. The changes include the kernel implementation, host-side integration, Python bindings, and extensive testing to validate its functionality and performance.

Highlights

  • New AscendC Triangular Inverse Kernel: Introduced tri_inv_cube_col_sweep, an AscendC kernel for triangular matrix inversion, leveraging both Vector Cores (AIV) and Cube Core (AIC) on Ascend NPUs.
  • Column Sweep Algorithm: The kernel implements a column sweep algorithm, optimized to take full advantage of NPU resources.
  • FP16 Support: The new kernel currently supports only fp16 data type due to Cube Core limitations, which might impact end-to-end accuracy but offers significant performance gains.
  • Performance Improvements: Benchmarks show considerable speed-ups for large batch sizes compared to existing implementations.
  • Python API Integration: A new Python function sgl_kernel_npu.fla.chunk.fast_cube_inv_tril is added, allowing users to easily utilize the new kernel.
  • Comprehensive Testing: New and updated test cases ensure the correctness and performance of the cube_triangular_inverse operation across various configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/CMakeLists.txt
    • Added build configurations for the new tri_inv_cube.cpp host-side code and tri_inv_cube_kernel.cpp kernel.
  • csrc/pytorch_extensions.cpp
    • Registered cube_triangular_inverse as a new PyTorch operator, mapping it to the sglang::npu_kernel::tri_inv_cube_col_sweep implementation.
  • csrc/tri_inv/README.md
    • Updated the description to differentiate between AIV-based and AIC-based triangular inversion kernels and their supported data types and matrix sizes.
  • csrc/tri_inv/op_host/tiling_tri_inv_cube.h
    • Introduced a new header defining the TriInvColumnSweepCubeTiling structure for kernel configuration.
  • csrc/tri_inv/op_host/tri_inv_cube.cpp
    • Implemented the host-side logic for the tri_inv_cube_col_sweep kernel, including tiling parameter calculation and kernel launch, with fp16 and square matrix validation.
  • csrc/tri_inv/op_kernel/kernel_mat_gen.h
    • Added the KernelMatGen class, an AIV-side component responsible for generating matrices for the column sweep algorithm and managing synchronization.
  • csrc/tri_inv/op_kernel/kernel_tri_inv_cube.h
    • Defined the KernelTriInvCubeColSweep class, the AIC-side component performing matrix multiplications and data transfers, orchestrating the triangular inverse computation.
  • csrc/tri_inv/op_kernel/tri_inv_cube_kernel.cpp
    • Provided the entry point for the tri_inv_cube_col_sweep_fp16 kernel, linking host-side parameters to the kernel execution.
  • include/sgl_kenel_npu_ops.h
    • Declared the new tri_inv_cube_col_sweep function and clarified the existing tri_inv_col_sweep as AIV-based.
  • python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py
    • Exposed the new kernel through fast_cube_inv_tril, a Python wrapper that handles input preparation for the cube_triangular_inverse operator.
  • tests/python/sgl_kernel_npu/test_gated_delta_ascendc_tri_inv.py
    • Updated tests to include fast_cube_inv_tril in parametrization and adjusted assertion tolerances to accommodate fp16 precision.
  • tests/python/sgl_kernel_npu/test_triangular_inverse.py
    • Added NPU device setup, a NumPy reference implementation (np_tril_inv_cube_cs), and new test cases for the cube_triangular_inverse kernel, along with modifications to existing tests for consistency.
Activity
  • A new AscendC kernel tri_inv_cube_col_sweep has been implemented for triangular matrix inversion.
  • The kernel leverages both Vector Cores (AIV) and Cube Core (AIC) for optimized performance.
  • Performance benchmarks are included in the PR description, demonstrating significant speed-ups for large batch sizes.
  • Python bindings and a new API function fast_cube_inv_tril have been added for easy integration.
  • Extensive unit tests have been added and updated to validate the new kernel's correctness and performance characteristics, including fp16 precision considerations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new AscendC kernel for triangular matrix inversion using AI Cores (AIC) and AI Vectors (AIV), aiming for better performance on large batches. The changes span the C++ kernel implementation, host-side logic, PyTorch op registration, and associated Python wrappers and tests. While the performance gains are notable, my review has uncovered several critical correctness issues. There appears to be a bug in the core inversion algorithm in kernel_mat_gen.h where an iteration is missed. The Python wrapper fast_cube_inv_tril seems to pass a matrix with the wrong memory layout (lower triangular instead of upper triangular) to the new kernel. Furthermore, a critical issue in the test file test_triangular_inverse.py causes one of the main tests to be skipped due to function redefinition. I've also noted several medium-severity issues, including very high test tolerances that could mask bugs, and areas for code cleanup to improve maintainability. These critical issues should be addressed to ensure the correctness of this new high-performance kernel.

Copy link
Contributor

@zouzias zouzias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants