Skip to content

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Nov 21, 2025

In theory

test/srt/test_deepseek_v3_fp4_4gpu.py should pass

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a targeted fix within the TensorRT-LLM (TRTLLM) Multi-Layer Attention (MLA) backend by eliminating a specific Triton kernel caching mechanism. The removal of the cached_triton_kernel decorator and its associated import is intended to correct an underlying issue that was causing a DeepSeek v3 FP4 4-GPU test to fail, thereby improving the stability and correctness of the TRTLLM integration.

Highlights

  • Triton Kernel Caching Removal: The cached_triton_kernel import and its application as a decorator to two Triton kernels, pad_draft_extend_query_kernel and unpad_draft_extend_output_kernel, have been removed.
  • TRTLLM MLA Specification Fix: This change specifically addresses an issue with TensorRT-LLM Multi-Layer Attention (MLA) specifications, aiming to resolve a problem that prevented a DeepSeek v3 FP4 4-GPU test from passing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the TRT-LLM MLA speculative decoding path by removing the @cached_triton_kernel decorator from pad_draft_extend_query_kernel and unpad_draft_extend_output_kernel. This custom caching layer appears to have been causing issues, and its removal is a clean and safe fix, falling back to Triton's standard JIT caching mechanism. Since the cache key was based on a constant BLOCK_SIZE, there should be no performance regression. The code is now simpler and more robust. The changes look good.

@b8zhong b8zhong added the run-ci label Nov 21, 2025
@github-actions github-actions bot added the blackwell SM100/SM120 label Nov 21, 2025
Copy link
Collaborator

@Qiaolin-Yu Qiaolin-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this? This cache is important and was added a very long time ago. Or perhaps the previous usage of this cache here was incorrect? Could you please explain more? 👀 I just took a close look at the code, and I think you’re right.

@hnyls2002
Copy link
Collaborator

@Qiaolin-Yu @b8zhong Please track the CI status. After the test/srt/test_deepseek_v3_fp4_4gpu.py passes, add a read-to-merge label

@b8zhong b8zhong force-pushed the brayden/fix-trtllm-mla-spec branch from e67b549 to cd7103e Compare November 30, 2025 05:47
@b8zhong
Copy link
Collaborator Author

b8zhong commented Dec 2, 2025

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit 236a7c2 into sgl-project:main Dec 2, 2025
173 of 182 checks passed
@b8zhong b8zhong deleted the brayden/fix-trtllm-mla-spec branch December 2, 2025 06:26
harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants