Fuse writing KV buffer into rope kernel (part 2: srt) by JeremieMelo · Pull Request #9014 · sgl-project/sglang

JeremieMelo · 2025-08-09T23:42:03Z

Motivation

Fuse set_kv_buffer to sgl-kernel rope function, only for trtllm_mha attention

(below is from @fzyzcjy)

speed may be suboptimal (I have not done any ncu profile or thorough optimization), but anyway it is faster than non-fused

tests/test_rotary_embedding.py ..................                                                                  [100%]

=================================================== 18 passed in 1.90s ===================================================
bench_rotary_embedding:
    batch_size  seq_len  save_kv_cache  SGL Kernel
0            1        1          False       2.788
1            1        1           True       3.119
2           32        1          False       3.052
3           32        1           True       3.299
4          128        1          False       3.167
5          128        1           True       3.463
6          512        1          False       3.925
7          512        1           True       4.091
8            2      512          False       5.643
9            2      512           True       6.371
10           4     4096          False      46.858
11           4     4096           True      57.892

acc looks good: 20B TP4, reasoning low, gpt_oss.evals gpqa: 55.1%

acc from @BBuf (he checked this agree with main):

| model_name                                              |   (‘metric’, ‘mmlu’) |
|:--------------------------------------------------------|---------------------:|
| o4-mini-with-chat-completion-and-4k-gen_20250811_143850 |                0.866 |

speedup: @BBuf test to be 345->355

gemini-code-assist

Summary of Changes

Hello @JeremieMelo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a significant performance optimization by fusing the Key-Value (KV) cache buffer saving operation directly into the Rotary Positional Embedding (RoPE) computation. This change is specifically applied to the trtllm_mha attention mechanism. The primary goal is to reduce overhead and improve efficiency by performing these two related operations within a single, optimized kernel, rather than as separate steps. This involves modifications across the Python and C++ layers of the attention and rotary embedding components, ensuring seamless integration and leveraging CUDA stream synchronization for better performance, especially in CUDA graph capture mode.

Highlights

KV Buffer and RoPE Fusion: I have fused the set_kv_buffer operation, which is responsible for saving key and value tensors to the KV cache, directly into the rotary positional embedding (RoPE) computation. This optimization specifically targets the trtllm_mha attention mechanism.
Streamlined KV Cache Updates: The previous explicit calls to set_kv_buffer in the trtllm_mha_backend have been removed. The KV cache update now occurs implicitly as part of the RotaryEmbedding's forward_cuda method, which calls a modified C++ kernel.
C++ Kernel Enhancements for Fused Operations: The underlying C++ CUDA kernel for applying rotary embeddings (apply_rope_pos_ids_cos_sin_cache) has been enhanced to directly handle the saving of the processed K and V tensors to the KV cache buffers. This includes handling data type conversions, scaling, and optimized memory copies, particularly for CUDA graph capture mode using an alternate stream for overlap.
Integration into Model Architecture: The GPTAttention and GPTLMHeadModel classes have been updated to pass necessary context, such as the attention layer, forward batch information, and the start_layer for memory pooling, to the rotary embedding function, enabling the fused operation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fuses the set_kv_buffer operation into the RoPE kernel for trtllm_mha attention, which is a good optimization. My review focuses on the correctness of the implementation. I've found a critical issue in the CUDA stream synchronization logic that could lead to race conditions or runtime errors, and a bug in how scaling factors are passed. I've also suggested some code cleanup by removing commented-out code blocks.

gemini-code-assist · 2025-08-09T23:43:21Z

sgl-kernel/csrc/elementwise/rope.cu

+    if (is_capture_mode && alt_stream_ptr != 0) {
+      cudaStream_t alt_stream = reinterpret_cast<cudaStream_t>(alt_stream_ptr);
+      cudaStream_t main_stream = stream;
+
+      // Wait for main stream to complete RoPE
+      cudaStreamWaitEvent(alt_stream, nullptr, 0);
+
+      // Copy K on main stream
+      k_buffer_ptr.copy_(k_rope, /*non_blocking=*/true);
+
+      // Copy V on alternate stream
+      at::cuda::CUDAStreamGuard guard(at::cuda::getStreamFromExternal(alt_stream, device.index()));
+      v_buffer_ptr.copy_(v, /*non_blocking=*/true);
+
+      // Main stream waits for alt stream
+      cudaStreamWaitEvent(main_stream, nullptr, 0);
+    } else {


The stream synchronization logic here is incorrect and will lead to runtime errors or race conditions. cudaStreamWaitEvent is called with a nullptr event, which is invalid. You need to create and use a cudaEvent_t to properly synchronize between the main and alternate streams. The original Python code stream.wait_stream(other_stream) uses an event pool internally, which needs to be replicated here.

Here is a suggestion for a correct implementation:

if (is_capture_mode && alt_stream_ptr != 0) { cudaStream_t alt_stream = reinterpret_cast<cudaStream_t>(alt_stream_ptr); cudaStream_t main_stream = stream; cudaEvent_t event; cudaEventCreateWithFlags(&event, cudaEventDisableTiming); // Wait for main stream to complete RoPE cudaEventRecord(event, main_stream); cudaStreamWaitEvent(alt_stream, event, 0); // Copy K on main stream k_buffer_ptr.copy_(k_rope, /*non_blocking=*/true); // Copy V on alternate stream { at::cuda::CUDAStreamGuard guard(at::cuda::getStreamFromExternal(alt_stream, device.index())); v_buffer_ptr.copy_(v, /*non_blocking=*/true); } // Main stream waits for alt stream cudaEventRecord(event, alt_stream); cudaStreamWaitEvent(main_stream, event, 0); cudaEventDestroy(event); } else {

sgl-kernel/python/sgl_kernel/elementwise.py

gemini-code-assist · 2025-08-09T23:43:21Z

python/sglang/srt/layers/attention/trtllm_mha_backend.py

+        # cache_loc = forward_batch.out_cache_loc
+        # if save_kv_cache and k is not None:
+        #     forward_batch.token_to_kv_pool.set_kv_buffer(
+        #         layer, cache_loc, k, v, layer.k_scale, layer.v_scale
+        #     )


This commented-out code should be removed as its functionality has been fused into the RoPE kernel.

gemini-code-assist · 2025-08-09T23:43:21Z

python/sglang/srt/layers/attention/trtllm_mha_backend.py

+        # cache_loc = forward_batch.out_cache_loc
+        # if save_kv_cache and k is not None:
+        #     forward_batch.token_to_kv_pool.set_kv_buffer(
+        #         layer, cache_loc, k, v, layer.k_scale, layer.v_scale
+        #     )


This commented-out code should be removed as its functionality has been fused into the RoPE kernel.

gemini-code-assist · 2025-08-09T23:43:21Z

sgl-kernel/python/sgl_kernel/elementwise.py

+    ## fused from memory_pool set_kv_buffer
+    """
+    if layer_id_override is not None:
+        layer_id = layer_id_override
+    else:
+        layer_id = layer.layer_id
+    if cache_k.dtype != self.dtype:
+        if k_scale is not None:
+            cache_k.div_(k_scale)
+        if v_scale is not None:
+            cache_v.div_(v_scale)
+        cache_k = cache_k.to(self.dtype)
+        cache_v = cache_v.to(self.dtype)
+
+    if self.store_dtype != self.dtype:
+        cache_k = cache_k.view(self.store_dtype)
+        cache_v = cache_v.view(self.store_dtype)
+
+    if get_is_capture_mode() and self.alt_stream is not None:
+        # Overlap the copy of K and V cache for small batch size
+        current_stream = self.device_module.current_stream()
+        self.alt_stream.wait_stream(current_stream)
+        self.k_buffer[layer_id - self.start_layer][loc] = cache_k
+        with self.device_module.stream(self.alt_stream):
+            self.v_buffer[layer_id - self.start_layer][loc] = cache_v
+        current_stream.wait_stream(self.alt_stream)
+    else:
+        self.k_buffer[layer_id - self.start_layer][loc] = cache_k
+        self.v_buffer[layer_id - self.start_layer][loc] = cache_v
+    """


This large block of commented-out code, which seems to be the original implementation from memory_pool.set_kv_buffer, should be removed to improve code clarity.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fzyzcjy · 2025-08-10T03:18:27Z

python/sglang/srt/layers/attention/trtllm_mha_backend.py

-            forward_batch.token_to_kv_pool.set_kv_buffer(
-                layer, cache_loc, k, v, layer.k_scale, layer.v_scale
-            )
+        # cache_loc = forward_batch.out_cache_loc


(don't forget to re-enable these code and do branching at correct places)

fzyzcjy · 2025-08-10T03:19:44Z

python/sglang/srt/layers/rotary_embedding.py

                cos_sin_cache=self.cos_sin_cache,
                is_neox=self.is_neox_style,
+                layer=layer,
+                forward_batch=forward_batch,


maybe we should not pass such objects to this API

what about e.g.

def apply_rope_with_cos_sin_cache_inplace( ..., # in non-fused version we do `k_buffer[loc] = data` etc k_buffer: Tensor, v_buffer: Tensor, loc: Tensor, )

and if none, it means we do not save kv cache; if non-none then we need to save

fzyzcjy · 2025-08-10T03:20:35Z

sgl-kernel/csrc/elementwise/pos_enc.cuh

confused why do we have a new file instead of minor modifications

I am unsure about the best practice for such modifications, e.g., ensuring compatibility. I currently use this style to avoid touching any old files/functions/logic, and introduce fully standalone files to decouple. Any suggestion to make it more compatible /extensible for future updates is appreciated.

I personally think the fuse is only a dozen lines thus can be inlined. note that you can use c++ template and constexpr if there are overheads

sgl-kernel/csrc/elementwise/pos_enc.cuh

fzyzcjy · 2025-08-10T07:42:39Z

python/sglang/srt/layers/rotary_embedding.py

        key: torch.Tensor,
        offsets: Optional[torch.Tensor] = None,
+        layer: Any = None,  # RadixAttention
+        forward_batch=None,


maybe we should not put layer,forward_batch etc into such a low-level. I personally suggest to put the k_buffer_ptr, v_buffer_ptr, etc, down to here. If too verbose, maybe make a simple @DataClass to pack them

zhyncs · 2025-08-11T20:52:16Z

@JeremieMelo please fix the conflicts

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

JeremieMelo added 2 commits August 9, 2025 23:29

fuse kv buffer to rope

280fbf4

pre-commit

2a2bbcf

JeremieMelo requested review from BBuf, Edwardf0t1, FlamingoPg, HaiShaw, HandH1998, Ying1123, ch-wan, ispobock, kushanam, merrymercy, yizhang2077 and zhyncs as code owners August 9, 2025 23:42

gemini-code-assist bot reviewed Aug 9, 2025

View reviewed changes

JeremieMelo and others added 6 commits August 9, 2025 16:50

Update sgl-kernel/python/sgl_kernel/elementwise.py

6932688

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix cuda stream event

ea3b04f

fix import

9ce2853

with_kv_buffer pos_enc kernel

b2d3b50

fix cache_loc offset to k_buffer in cuda kernel

e132778

fix format

c1fa975

fzyzcjy reviewed Aug 10, 2025

View reviewed changes

JeremieMelo added 3 commits August 10, 2025 07:19

fix argument, flag

95013c2

remove alt_stream

168482a

remove comments, add argument doc

e6c5826

fzyzcjy reviewed Aug 10, 2025

View reviewed changes

JeremieMelo added 3 commits August 10, 2025 16:49

fix compiling err, remove func copies

d5304b7

gpt-oss add trtllm_mha backend condition for fusion

aecd700

fix v_vec

bd8d5b3

fzyzcjy added 2 commits August 11, 2025 18:00

fix

b9d1678

doc

0d02a59

BBuf approved these changes Aug 11, 2025

View reviewed changes

BBuf added the high priority label Aug 11, 2025

Merge branch 'main' into jiaqi/kv_rope_fuse

6ad34af

fzyzcjy changed the title ~~Fuse writing KV buffer into rope kernel~~ Fuse writing KV buffer into rope kernel (part 2: srt) Aug 11, 2025

fzyzcjy mentioned this pull request Aug 11, 2025

Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) #9077

Merged

4 tasks

zhyncs assigned BBuf Aug 11, 2025

fzyzcjy and others added 5 commits August 12, 2025 07:33

Merge branch 'main' into jiaqi/kv_rope_fuse

27529cf

fmt

646188d

Merge branch 'main' into jiaqi/kv_rope_fuse

b6cc5ac

Merge branch 'main' into jiaqi/kv_rope_fuse

4305b52

bump

841ea80

fzyzcjy requested review from ByronHsu, CatherineSue and slin1237 as code owners August 12, 2025 09:49

zhyncs approved these changes Aug 12, 2025

View reviewed changes

fzyzcjy added 4 commits August 12, 2025 18:12

more

d1c1f0f

more

debe96c

ci

bc7aa41

more

e7746e2

zhyncs merged commit c9ee738 into sgl-project:main Aug 12, 2025
222 of 245 checks passed

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Fuse writing KV buffer into rope kernel (part 2: srt) (sgl-project#9014)

50e3709

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

b8zhong mentioned this pull request Aug 31, 2025

[Bugfix] GPT-OSS + torch.compile + --moe-runner-backend triton compatibility #9833

Closed

2 tasks

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Fuse writing KV buffer into rope kernel (part 2: srt) (sgl-project#9014)

86514da

Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

yuan-luo mentioned this pull request Sep 22, 2025

Fuse write kv buffer into rope for qwen3 moe & bailing moe #10749

Merged

4 tasks

b8zhong mentioned this pull request Sep 26, 2025

Fuse KV write + RoPE for Llama #10945

Closed

1 task

wejoncy mentioned this pull request Oct 15, 2025

Fuse writing KV buffer into rope kernel (amd gpu sgl-kernel) #11654

Open

4 tasks

pansicheng mentioned this pull request Feb 5, 2026

[Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace #18155

Merged

5 tasks

Conversation

JeremieMelo commented Aug 9, 2025 • edited by fzyzcjy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

JeremieMelo Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Aug 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JeremieMelo commented Aug 9, 2025 •

edited by fzyzcjy

Loading