[Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace by pansicheng · Pull Request #18155 · sgl-project/sglang

pansicheng · 2026-02-03T05:53:43Z

Motivation

Add JIT-compiled CUDA kernels for apply_rope_with_cos_sin_cache_inplace

Modifications

Accuracy Tests

Verified against apply_rope_with_cos_sin_cache_inplace from sgl_kernel
Check python/sglang/jit_kernel/tests/test_rope.py
Run python -m pytest python/sglang/jit_kernel/tests/test_rope.py -s

Accuracy test on gsm8k after applying jit apply_rope_with_cos_sin_cache_inplace to srt

python3 -m sglang.launch_server --model-path /path/to/Qwen3-8B
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.902
Invalid: 0.000
Latency: 161.404 s
Output throughput: 1073.649 token/s

Benchmarking and Profiling

A10

Performance Test - Batch=8, SeqLen=256
JIT: 0.039167404ms, SGL: 0.039143562ms
Speedup (SGL/JIT): 1.00x
.
Performance Test - Batch=8, SeqLen=512
JIT: 0.077519417ms, SGL: 0.077657700ms
Speedup (SGL/JIT): 1.00x
.
Performance Test - Batch=8, SeqLen=1024
JIT: 0.156660080ms, SGL: 0.157065392ms
Speedup (SGL/JIT): 1.00x
.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-03T05:53:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-02-03T11:19:06Z

/tag-and-rerun-ci

BBuf · 2026-02-03T11:21:02Z

python/sglang/jit_kernel/csrc/elementwise/rope.cuh

+    size_t k_rope_stride_h = k_rope.stride(1);
+
+    auto query_dtype = q.dtype();
+    const cudaStream_t stream = at::cuda::getCurrentCUDAStream();


@DarkSharpness Should we avoid use torch aten in jit_kernel, how can we replace at::cuda::getCurrentCUDAStream() in tvm-ffi.

Can we use const cudaStream_t stream = LaunchKernel::resolve_device(device); to replace torch Aten CUDA Stream?

BBuf · 2026-02-04T10:33:23Z

/rerun-failed-ci

BBuf · 2026-02-05T07:21:48Z

python/sglang/jit_kernel/rope.py

+
+@cache_once
+def _jit_apply_rope_pos_ids_cos_sin_cache_module() -> Module:
+    import flashinfer


Can we move the line to the top?

BBuf · 2026-02-05T07:22:06Z

python/sglang/jit_kernel/rope.py

+    import flashinfer
+
+    flashinfer_dir = pathlib.Path(flashinfer.__file__).parent.resolve()
+    assert (flashinfer_dir / "data" / "include").exists()


Add a check message?

BBuf · 2026-02-05T07:30:58Z

python/sglang/jit_kernel/rope.py

+# and torch.custom_op cannot express optional mutates_args reliably
+@custom_op(
+    "sgl_jit_kernel::apply_rope_pos_ids_cos_sin_cache_save_kv_cache",
+    mutates_args="unknown",


Why unknown here?

DarkSharpness · 2026-02-05T08:05:02Z

one small question: do we really need to upstream all the kernels from flashinfer?

In most cases, i guess we can directly use flashinfer kernels if our JIT kernels doesn't have advantages in performance.

pansicheng · 2026-02-05T08:22:32Z

one small question: do we really need to upstream all the kernels from flashinfer?

In most cases, i guess we can directly use flashinfer kernels if our JIT kernels doesn't have advantages in performance.

It seems the goal is to build a fused RoPE + KV-cache-save kernel, so the BatchQKApplyRotaryPosIdsCosSinCacheEnhanced class is implemented. If KV-cache saving isn’t needed, then BatchQKApplyRotaryPosIdsCosSinCache should be able to call flashInfer directly from Python.

#9077
#9014

BBuf · 2026-02-05T09:40:16Z

/rerun-failed-ci

BBuf · 2026-02-05T13:49:28Z

https://github.com/sgl-project/sglang/actions/runs/21704922541/job/62604260555?pr=18155

…8155)

yuan-luo · 2026-02-13T08:51:15Z

one small question: do we really need to upstream all the kernels from flashinfer?

In most cases, i guess we can directly use flashinfer kernels if our JIT kernels doesn't have advantages in performance.

@DarkSharpness Totally agree with you. Honestly speaking, I observed some perf regression for jit kernel, which introduces unnecessary compiling. It might not be implemented by design, but I'd like to say if jit kernel is not well applied, it introduces more overhead than profits. Take the following case for example:

…8155)

pansicheng added 3 commits February 2, 2026 12:56

Add JIT apply_rope_pos_ids_cos_sin_cache

a689dfc

add bench

a580e2c

Apply jit_kernel to srt

dd6539a

pansicheng requested review from BBuf, DarkSharpness, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners February 3, 2026 05:53

BBuf approved these changes Feb 3, 2026

View reviewed changes

github-actions bot added the run-ci label Feb 3, 2026

BBuf reviewed Feb 3, 2026

View reviewed changes

pansicheng added 3 commits February 4, 2026 02:17

Simplify stream handling

55bb93d

Update test

736f7d3

Resolve flashinfer_include_path

1373957

BBuf reviewed Feb 5, 2026

View reviewed changes

Register torch custom_op for piecewise cuda graph

8f6dee9

pansicheng force-pushed the jit-rope branch from e69e28a to 8f6dee9 Compare February 5, 2026 08:00

Fix cr

896c3c8

BBuf merged commit 2eb4359 into sgl-project:main Feb 5, 2026
104 of 112 checks passed

BBuf mentioned this pull request Feb 5, 2026

[Feature] sgl-kernel wheel slimming plan tracking #17865

Open

74 tasks

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 9, 2026

[Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace (sgl-project#1…

17291bd

…8155)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace (sgl-project#1…

6988fbb

…8155)

Comments

Conversation

pansicheng commented Feb 3, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Uh oh!

BBuf commented Feb 3, 2026

Uh oh!

BBuf Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkSharpness commented Feb 5, 2026

Uh oh!

pansicheng commented Feb 5, 2026

Uh oh!

BBuf commented Feb 5, 2026

Uh oh!

BBuf commented Feb 5, 2026

Uh oh!

Uh oh!

yuan-luo commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BBuf Feb 3, 2026 •

edited

Loading