Skip to content

Conversation

@Makcum888e
Copy link
Contributor

@Makcum888e Makcum888e commented Sep 25, 2025

Motivation

use of torch_npu implementation of mrope instead of torch native implementation with many small operations can give a significant increase of performance

Modifications

forward renamed to forward_native
overridden functions forward_cpu and forward_cuda to not use it from base class
overridden function forward_npu with torch_npu implementation

Accuracy Tests

tested on Qwen3-VL-4B-Instruct

GSM8k:
before
Accuracy: 0.729
Invalid: 0.122
Latency: 216.923 s
Output throughput: 961.137 token/s

after
Accuracy: 0.728
Invalid: 0.124
Latency: 178.438 s
Output throughput: 1155.518 token/s

MMMU:
before
Benchmark time: 221.995
Overall accuracy: 0.499

after:
Benchmark time: 208.065
Overall accuracy: 0.496

Benchmarking and Profiling

bench_serving num prompts 64 random input len 2048 random output len 2048
before
Mean TTFT (ms): 3667.45
Mean ITL (ms): 49.65

after
Mean TTFT (ms): 3494.33
Mean ITL (ms): 43.61

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Makcum888e, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an optimization for the MRotaryEmbedding component by integrating the torch_npu implementation of mrope. This change aims to leverage specialized hardware acceleration on Ascend NPUs, leading to improved computational efficiency and reduced latency for AI workloads. The core logic has been refactored to accommodate this NPU-specific path while maintaining existing CPU and CUDA functionalities.

Highlights

  • Performance Optimization: The primary goal of this PR is to significantly improve the performance of MRotaryEmbedding by utilizing the torch_npu implementation of mrope, which is optimized for Ascend NPUs.
  • Code Refactoring: The existing forward method has been renamed to forward_native. New methods, forward_cpu and forward_cuda, have been introduced to explicitly call forward_native, ensuring compatibility while allowing forward_npu to be used from the base class for NPU-specific execution.
  • Accuracy and Performance Benchmarks: Accuracy tests on GSM8k and MMMU show minimal changes in accuracy (0.729 -> 0.728 for GSM8k, 0.499 -> 0.496 for MMMU), while demonstrating notable performance gains. GSM8k latency improved from 216.923s to 178.438s, and output throughput increased from 961.137 token/s to 1155.518 token/s. MMMU benchmark time decreased from 221.995s to 208.065s.
  • Serving Benchmarks: Benchmarking with bench_serving (64 prompts, random input/output length 2048) shows improvements in Mean TTFT (3667.45ms -> 3494.33ms) and Mean ITL (49.65ms -> 43.61ms).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize MRotaryEmbedding for NPU by using torch_npu.npu_mrope. The changes involve renaming forward to forward_native and adding device-specific forward_cpu and forward_cuda methods. However, I've found some critical issues. The new forward_cpu and forward_cuda methods have incorrect signatures, which will lead to runtime errors. More importantly, the NPU implementation for MRotaryEmbedding seems to be missing. It inherits forward_npu from the base class, which is incorrect for MRotaryEmbedding and will not work as intended. I've provided suggestions to fix these issues.

Copy link
Contributor

@ssshinigami ssshinigami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ssshinigami ssshinigami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hnyls2002 hnyls2002 merged commit c2e56da into sgl-project:main Nov 12, 2025
71 of 84 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants