[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels by Johnsonms · Pull Request #13812 · sgl-project/sglang

Johnsonms · 2025-11-23T19:24:52Z

Motivation

#13811

Modifications

Implement fused Triton kernels that:

Retrieve both K and S data in a single kernel call
Implement GetK, GetS and GetKandS triton kernels,
Switch from torch_fast to optimized Triton implementations

Accuracy Tests

Before:

After:

Benchmarking and Profiling

Reduced CPU time from 166 µs to 63 µs per layer.
Reduced kernel latency from 63 µs to 9 µs per layer.
Reduced kernel launches from 11 times to 1 time per layer.

Before:

After:

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-23T19:25:10Z

Summary of Changes

Hello @Johnsonms, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the NSA Indexer's K/S buffer access within the SGLang framework. It achieves this by migrating from existing torch_fast implementations to highly optimized, fused Triton kernels. The primary goal is to reduce computational overhead and improve efficiency when retrieving key and scale data, leading to substantial gains in processing speed and resource utilization.

Highlights

Performance Optimization: Achieved significant speedups by replacing existing torch_fast implementations with highly optimized Triton kernels for K/S buffer access.
Fused K/S Data Retrieval: Introduced a new fused Triton kernel (GetKAndS) that retrieves both Key (K) and Scale (S) data in a single, more efficient operation.
Dedicated Triton Kernels: Implemented specific Triton kernels for GetK, GetS, and the new fused GetKAndS operations, enhancing data gathering from paged buffers.
Reduced Kernel Launches: The number of kernel launches per layer was drastically reduced from 11 to 1, contributing to overall efficiency.
Improved Latency and CPU Time: Benchmarking shows kernel latency dropped from 63 µs to 9 µs, and CPU time per layer decreased from 166 µs to 63 µs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations by implementing fused Triton kernels for accessing K/S buffers in the NSA Indexer. The changes are well-structured and effectively reduce kernel launch overhead. My review includes a few minor suggestions to enhance code clarity and maintainability, such as removing commented-out debug statements and refactoring a small part of a Triton kernel to avoid redundant computations.

python/sglang/srt/layers/attention/nsa/index_buf_accessor.py

Fridge003 · 2025-11-27T16:56:18Z

Please add correctness tests for the get_k_and_s kernel (can put them under test/manual folder)

Fridge003 · 2025-12-02T01:47:11Z

/tag-and-rerun-ci

Fridge003 · 2025-12-02T04:59:26Z

@Johnsonms Can you verify the correctness of this PR on AIME/GPQA?
Just make sure it doesn't hit any IMA error related to triton kernel (https://docs.sglang.io/basic_usage/deepseek_v32.html#benchmarking-results)

Johnsonms · 2025-12-02T07:51:37Z

FP4 model gpqa result
python3 -m sglang.test.run_eval --port 54321 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

aime 2025 result

#! /bin/bash
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

ns prepare_data aime25

PORT=30000
BACKEND=sglang
MODEL="deepseek-ai/DeepSeek-V3.2-Exp"
MODEL_NAME="dsv32-fp4"

echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
ns eval \
  --benchmarks=aime25:4 \
  --server_type=$BACKEND \
  --model=$MODEL \
  --server_address=http://localhost:${PORT}/v1 \
  --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
  ++max_concurrent_requests=512 \
  ++server.api_key=dummy \
  ++inference.tokens_to_generate=64000

Johnsonms · 2025-12-02T16:56:46Z

FP8
python3 -m sglang.test.run_eval --port 54321 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

FP8
aime 2025 result

…n Kernels

Removed commented print statements for Triton functions.

…n Kernels (sgl-project#13812) Co-authored-by: Johnsonms <johnson@together.ai>

Johnsonms requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners November 23, 2025 19:24

gemini-code-assist bot reviewed Nov 23, 2025

View reviewed changes

python/sglang/srt/layers/attention/nsa/index_buf_accessor.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/nsa/index_buf_accessor.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/nsa/index_buf_accessor.py Show resolved Hide resolved

Johnsonms force-pushed the optimized_indexer_kvcache branch 2 times, most recently from c55a480 to 21e94e4 Compare November 26, 2025 17:37

Fridge003 reviewed Nov 27, 2025

View reviewed changes

python/sglang/srt/layers/attention/nsa/index_buf_accessor.py Show resolved Hide resolved

python/sglang/srt/layers/attention/nsa/index_buf_accessor.py Show resolved Hide resolved

Johnsonms force-pushed the optimized_indexer_kvcache branch 3 times, most recently from fd63f10 to c13e826 Compare November 29, 2025 04:21

Johnsonms requested a review from Fridge003 November 29, 2025 04:25

Johnsonms force-pushed the optimized_indexer_kvcache branch 2 times, most recently from f971d49 to 5d84e0f Compare November 29, 2025 21:52

Fridge003 approved these changes Dec 2, 2025

View reviewed changes

github-actions bot added the run-ci label Dec 2, 2025

johnsontogether and others added 2 commits December 2, 2025 10:51

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Trito…

e86e203

…n Kernels

Remove debug print statements from Triton functions

bc1494b

Removed commented print statements for Triton functions.

johnsontogether added 2 commits December 2, 2025 10:51

[Test] Add correctness tests for NSA Indexer K/S Buffer Triton kernels

446313a

[Bug] fix the lint error

3caef92

Johnsonms force-pushed the optimized_indexer_kvcache branch from 6d38869 to 3caef92 Compare December 2, 2025 18:51

Fridge003 merged commit 043f131 into sgl-project:main Dec 3, 2025
90 of 94 checks passed

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Trito…

9108046

…n Kernels (sgl-project#13812) Co-authored-by: Johnsonms <johnson@together.ai>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Trito…

bc69d15

…n Kernels (sgl-project#13812) Co-authored-by: Johnsonms <johnson@together.ai>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Trito…

f3d9a8b

…n Kernels (sgl-project#13812) Co-authored-by: Johnsonms <johnson@together.ai>

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Trito…

aba88e2

…n Kernels (sgl-project#13812) Co-authored-by: Johnsonms <johnson@together.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels#13812

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels#13812
Fridge003 merged 4 commits intosgl-project:mainfrom
Johnsonms:optimized_indexer_kvcache

Johnsonms commented Nov 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 27, 2025

Uh oh!

Fridge003 commented Dec 2, 2025

Uh oh!

Fridge003 commented Dec 2, 2025

Uh oh!

Johnsonms commented Dec 2, 2025 •

edited

Loading

Uh oh!

Johnsonms commented Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Johnsonms commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 27, 2025

Uh oh!

Fridge003 commented Dec 2, 2025

Uh oh!

Fridge003 commented Dec 2, 2025

Uh oh!

Johnsonms commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Johnsonms commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Johnsonms commented Nov 23, 2025 •

edited

Loading

Johnsonms commented Dec 2, 2025 •

edited

Loading

Johnsonms commented Dec 2, 2025 •

edited

Loading