[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels by celve · Pull Request #17889 · sgl-project/sglang

celve · 2026-01-28T15:48:01Z

Motivation

Add JIT-compiled CUDA kernels for MLA tensor concatenation:

concat_mla_k
concat_mla_absorb_q

Modifications

python/sglang/jit_kernel/concat_mla.py: Python interface
python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh
python/sglang/jit_kernel/tests/test_concat_mla.py: Unit tests

Accuracy Tests

Verified against PyTorch implementation and AOT sgl_kernel.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T15:48:20Z

Summary of Changes

Hello @celve, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates performance-critical JIT-compiled CUDA kernels into the SGLang framework to accelerate tensor concatenation operations fundamental to Multi-head Latent Attention. By offloading these computations to optimized GPU kernels, the change aims to significantly improve the efficiency of MLA-related processes. The introduction is accompanied by robust testing to ensure functional correctness and compatibility with existing components.

Highlights

New JIT-compiled CUDA Kernels: Introduced two new JIT-compiled CUDA kernels: concat_mla_k and concat_mla_absorb_q, designed for efficient tensor concatenation in Multi-head Latent Attention (MLA) operations.
Python Interfaces: Provided Python interfaces in python/sglang/jit_kernel/concat_mla.py to allow easy integration and usage of the new CUDA kernels within the SGLang framework.
Optimized concat_mla_k: The concat_mla_k kernel efficiently broadcasts k_rope across all heads while directly copying k_nope values, optimizing a common MLA tensor manipulation pattern.
Optimized concat_mla_absorb_q: The concat_mla_absorb_q kernel performs concatenation of two tensors (a and b) specifically for MLA absorbed Q computation, aiming for performance gains.
Comprehensive Unit Testing: Added unit tests in python/sglang/jit_kernel/tests/test_concat_mla.py to verify the correctness of the JIT kernels against both PyTorch reference implementations and existing AOT (Ahead-Of-Time) compiled SGLang kernels, ensuring bitwise equivalence where applicable.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces JIT-compiled CUDA kernels for MLA tensor concatenation, concat_mla_k and concat_mla_absorb_q. The changes include the Python interface, the CUDA C++ kernel implementation, and comprehensive unit tests. The code is well-structured and the tests are thorough, verifying correctness against PyTorch and consistency with AOT versions. My feedback focuses on performance optimizations within the concat_mla_absorb_q_kernel by hoisting loop-invariant computations and suggesting the use of non-temporal memory operations for better performance, consistent with the other kernel in the file.

gemini-code-assist · 2026-01-28T15:50:12Z

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh

+constexpr int B_LAST_DIM = 64;
+constexpr int OUT_LAST_DIM = A_LAST_DIM + B_LAST_DIM;
+
+__global__ void concat_mla_absorb_q_kernel(


The memory access pattern in this kernel is streaming, similar to concat_mla_k_kernel. For better performance and consistency, consider using non-temporal load/store intrinsics (e.g., ld_na_global_..., st_na_global_...) for a, b, and out tensors. This can reduce L1 cache pollution. Using these would likely require adding v4 variants for int4 types to the memory utilities, or composing them from existing v1 and v2 variants.

gemini-code-assist · 2026-01-28T15:50:12Z

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh

+#pragma unroll
+  for (int i = 0; i < A_NUM_UNROLL; ++i) {
+    const ABufType* base_addr = reinterpret_cast<ABufType*>(a + idx_0 * a_stride_0 + idx_1 * a_stride_1);
+    a_buf[i] = *(base_addr + i * 32 + lane_id);
+  }


For performance, the base address calculation, which is loop-invariant, should be hoisted out of the loop. This avoids redundant calculations in each iteration.

const ABufType* base_addr = reinterpret_cast<ABufType*>(a + idx_0 * a_stride_0 + idx_1 * a_stride_1); #pragma unroll for (int i = 0; i < A_NUM_UNROLL; ++i) { a_buf[i] = *(base_addr + i * 32 + lane_id); }

gemini-code-assist · 2026-01-28T15:50:12Z

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh

+#pragma unroll
+  for (int i = 0; i < A_NUM_UNROLL; ++i) {
+    ABufType* base_addr = reinterpret_cast<ABufType*>(out + idx_0 * out_stride_0 + idx_1 * out_stride_1);
+    *(base_addr + i * 32 + lane_id) = a_buf[i];
+  }


Similarly, the base address calculation for the output tensor should be hoisted out of the loop to avoid redundant computations.

ABufType* base_addr = reinterpret_cast<ABufType*>(out + idx_0 * out_stride_0 + idx_1 * out_stride_1); #pragma unroll for (int i = 0; i < A_NUM_UNROLL; ++i) { *(base_addr + i * 32 + lane_id) = a_buf[i]; }

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97084edb58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-28T15:52:36Z

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh

+    TensorMatcher({N0_out, N1_out, D_out})
+        .with_strides({S0_out, S1_out, 1})
+        .with_dtype<bf16_t>()
+        .with_device<kDLCUDA>(device)
+        .verify(out);


Require output dims to match inputs in concat_mla_absorb_q

The output tensor’s leading dimensions are verified with independent symbols (N0_out, N1_out) but never required to match a/b. The kernel computes num_items and indexing from a’s sizes and then writes into out, so a caller that passes a preallocated out with smaller or different first dimensions will cause out-of-bounds writes or corrupted results. Add a RuntimeCheck tying N0_out/N1_out to N0_a/N1_a (or derive indexing from out) to prevent this.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This pull request adds JIT-compiled CUDA kernels for Multi-head Latent Attention (MLA) tensor concatenation operations, supporting models like DeepSeek-V2/V3/R1.

Changes:

Added Python interface for two JIT kernels: concat_mla_k and concat_mla_absorb_q
Implemented optimized CUDA kernels with warp-level parallelism and memory access optimizations
Added comprehensive unit tests verifying correctness against PyTorch reference and AOT kernel implementations

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
python/sglang/jit_kernel/concat_mla.py	Python interface providing module loading, error handling, and public API for the JIT kernels
python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh	CUDA kernel implementations with memory utilities, tensor validation, and optimized memory access patterns
python/sglang/jit_kernel/tests/test_concat_mla.py	Unit tests comparing JIT kernels against PyTorch reference and AOT implementations with various input sizes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh

python/sglang/jit_kernel/concat_mla.py

BBuf · 2026-01-29T13:02:25Z

We should apply this kernel to deepseek v3/r1.

BBuf · 2026-01-29T13:18:49Z

@DarkSharpness Any other advices?

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh

DarkSharpness · 2026-01-29T16:16:56Z

Do you have any benchmark result @celve ?

celve · 2026-01-30T03:26:43Z

Do you have any benchmark result @celve ?

Benchmark results on H100:

concat-mla-k-performance:
   num_tokens  SGL AOT Kernel  SGL JIT Kernel      PyTorch
0       256.0        5.475417        5.668444    15.477764
1       512.0       11.878400       11.931831    31.441071
2      1024.0       29.546901       32.658746    79.164883
3      2048.0       62.138934       60.133575   138.081330
4      4096.0      125.988555      132.837618   305.827204
5      8192.0      248.622000      231.221624   551.673912
6     16384.0      514.511665      526.252979  1185.922662
7     32768.0     1003.911972      924.476412  2152.489662
concat-mla-absorb-q-performance:
    dim_0  dim_1  SGL AOT Kernel  SGL JIT Kernel   PyTorch
0     1.0    1.0        1.693475        1.665798  2.882798
1     1.0    8.0        1.770887        1.777346  4.797667
2     1.0   32.0        2.511637        2.509455  4.951636
3     1.0  128.0        2.525998        2.513862  5.099835
4     4.0    1.0        1.675436        1.677608  4.361981
5     4.0    8.0        2.512797        2.509239  4.951825
6     4.0   32.0        2.531210        2.514974  5.102467
7     4.0  128.0        2.562861        2.590036  5.345432
8     8.0    1.0        1.772115        1.777019  4.801221
9     8.0    8.0        2.529712        2.523697  5.007935
10    8.0   32.0        2.561970        2.545978  5.166949
11    8.0  128.0        2.600290        2.594486  5.825511
12   16.0    1.0        2.020227        2.006557  4.937140
13   16.0    8.0        2.528193        2.514921  5.113745
14   16.0   32.0        2.564538        2.593589  5.372622
15   16.0  128.0        2.606951        2.608138  6.601896
16   32.0    1.0        2.523933        2.518549  4.940041
17   32.0    8.0        2.562259        2.546113  5.180123
18   32.0   32.0        2.605903        2.591558  5.857756
19   32.0  128.0        2.975260        2.977352  8.561575

BBuf · 2026-02-01T11:35:27Z

@celve fix lint

celve · 2026-02-01T13:30:59Z

@celve fix lint

fixed

BBuf · 2026-02-02T14:36:27Z

/tag-and-rerun-ci

BBuf · 2026-02-03T02:49:12Z

https://github.com/sgl-project/sglang/actions/runs/21563481067/job/62282902448?pr=17889

…ct#17889)

celve added 3 commits January 28, 2026 23:36

wip: add jit concat mla

03e0b33

feat: add test

5b35ae5

wip: different return style

97084ed

celve requested a review from DarkSharpness as a code owner January 28, 2026 15:48

Copilot AI review requested due to automatic review settings January 28, 2026 15:48

celve requested a review from BBuf as a code owner January 28, 2026 15:48

Copilot started reviewing on behalf of celve January 28, 2026 15:48 View session

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Jan 28, 2026

View reviewed changes

Copilot AI reviewed Jan 28, 2026

View reviewed changes

BBuf mentioned this pull request Jan 29, 2026

[Feature] sgl-kernel wheel slimming plan tracking #17865

Open

74 tasks

BBuf reviewed Jan 29, 2026

View reviewed changes

python/sglang/jit_kernel/concat_mla.py Outdated Show resolved Hide resolved

BBuf reviewed Jan 29, 2026

View reviewed changes

python/sglang/jit_kernel/concat_mla.py Outdated Show resolved Hide resolved

wip: remove can use helper

299a76a

DarkSharpness reviewed Jan 29, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh Outdated Show resolved Hide resolved

DarkSharpness reviewed Jan 29, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/concat_mla.cuh Outdated Show resolved Hide resolved

DarkSharpness approved these changes Jan 29, 2026

View reviewed changes

celve added 2 commits January 30, 2026 10:12

wip: add benchmark

8b3094e

wip: align with utils

79a1026

wip: fix lint issues

c000169

github-actions bot added the run-ci label Feb 2, 2026

BBuf changed the title ~~[Kernel] Add JIT concat MLA kernels~~ [Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels Feb 3, 2026

BBuf merged commit 9b1619c into sgl-project:main Feb 3, 2026
200 of 217 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels (sgl-proje…

126a227

…ct#17889)

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels (sgl-proje…

40876d2

…ct#17889)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels (sgl-proje…

fdb9405

…ct#17889)

Comments

Conversation

celve commented Jan 28, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Jan 29, 2026

Uh oh!

BBuf commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Jan 29, 2026

Uh oh!

celve commented Jan 30, 2026

Uh oh!

BBuf commented Feb 1, 2026

Uh oh!

celve commented Feb 1, 2026

Uh oh!

BBuf commented Feb 2, 2026

Uh oh!

BBuf commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants