Skip to content

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels#13812

Merged
Fridge003 merged 4 commits intosgl-project:mainfrom
Johnsonms:optimized_indexer_kvcache
Dec 3, 2025
Merged

[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels#13812
Fridge003 merged 4 commits intosgl-project:mainfrom
Johnsonms:optimized_indexer_kvcache

Conversation

@Johnsonms
Copy link
Contributor

@Johnsonms Johnsonms commented Nov 23, 2025

Motivation

#13811

Modifications

Implement fused Triton kernels that:

  1. Retrieve both K and S data in a single kernel call
  2. Implement GetK, GetS and GetKandS triton kernels,
  3. Switch from torch_fast to optimized Triton implementations

Accuracy Tests

Before:
image

After:
image

Benchmarking and Profiling

  1. Reduced CPU time from 166 µs to 63 µs per layer.
  2. Reduced kernel latency from 63 µs to 9 µs per layer.
  3. Reduced kernel launches from 11 times to 1 time per layer.

Before:
image
image

After:
image
image

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Johnsonms, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the NSA Indexer's K/S buffer access within the SGLang framework. It achieves this by migrating from existing torch_fast implementations to highly optimized, fused Triton kernels. The primary goal is to reduce computational overhead and improve efficiency when retrieving key and scale data, leading to substantial gains in processing speed and resource utilization.

Highlights

  • Performance Optimization: Achieved significant speedups by replacing existing torch_fast implementations with highly optimized Triton kernels for K/S buffer access.
  • Fused K/S Data Retrieval: Introduced a new fused Triton kernel (GetKAndS) that retrieves both Key (K) and Scale (S) data in a single, more efficient operation.
  • Dedicated Triton Kernels: Implemented specific Triton kernels for GetK, GetS, and the new fused GetKAndS operations, enhancing data gathering from paged buffers.
  • Reduced Kernel Launches: The number of kernel launches per layer was drastically reduced from 11 to 1, contributing to overall efficiency.
  • Improved Latency and CPU Time: Benchmarking shows kernel latency dropped from 63 µs to 9 µs, and CPU time per layer decreased from 166 µs to 63 µs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations by implementing fused Triton kernels for accessing K/S buffers in the NSA Indexer. The changes are well-structured and effectively reduce kernel launch overhead. My review includes a few minor suggestions to enhance code clarity and maintainability, such as removing commented-out debug statements and refactoring a small part of a Triton kernel to avoid redundant computations.

@Johnsonms Johnsonms force-pushed the optimized_indexer_kvcache branch 2 times, most recently from c55a480 to 21e94e4 Compare November 26, 2025 17:37
@Fridge003
Copy link
Collaborator

Please add correctness tests for the get_k_and_s kernel (can put them under test/manual folder)

@Johnsonms Johnsonms force-pushed the optimized_indexer_kvcache branch 3 times, most recently from fd63f10 to c13e826 Compare November 29, 2025 04:21
@Johnsonms Johnsonms requested a review from Fridge003 November 29, 2025 04:25
@Johnsonms Johnsonms force-pushed the optimized_indexer_kvcache branch 2 times, most recently from f971d49 to 5d84e0f Compare November 29, 2025 21:52
@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Dec 2, 2025
@Fridge003
Copy link
Collaborator

@Johnsonms Can you verify the correctness of this PR on AIME/GPQA?
Just make sure it doesn't hit any IMA error related to triton kernel (https://docs.sglang.io/basic_usage/deepseek_v32.html#benchmarking-results)

@Johnsonms
Copy link
Contributor Author

Johnsonms commented Dec 2, 2025

FP4 model gpqa result
python3 -m sglang.test.run_eval --port 54321 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

image

aime 2025 result

#! /bin/bash
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

ns prepare_data aime25

PORT=30000
BACKEND=sglang
MODEL="deepseek-ai/DeepSeek-V3.2-Exp"
MODEL_NAME="dsv32-fp4"

echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
ns eval \
  --benchmarks=aime25:4 \
  --server_type=$BACKEND \
  --model=$MODEL \
  --server_address=http://localhost:${PORT}/v1 \
  --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
  ++max_concurrent_requests=512 \
  ++server.api_key=dummy \
  ++inference.tokens_to_generate=64000
image

@Johnsonms
Copy link
Contributor Author

Johnsonms commented Dec 2, 2025

FP8
python3 -m sglang.test.run_eval --port 54321 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3
image

FP8
aime 2025 result
image

@Johnsonms Johnsonms force-pushed the optimized_indexer_kvcache branch from 6d38869 to 3caef92 Compare December 2, 2025 18:51
@Fridge003 Fridge003 merged commit 043f131 into sgl-project:main Dec 3, 2025
90 of 94 checks passed
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
…n Kernels (sgl-project#13812)

Co-authored-by: Johnsonms <johnson@together.ai>
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
…n Kernels (sgl-project#13812)

Co-authored-by: Johnsonms <johnson@together.ai>
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
…n Kernels (sgl-project#13812)

Co-authored-by: Johnsonms <johnson@together.ai>
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
…n Kernels (sgl-project#13812)

Co-authored-by: Johnsonms <johnson@together.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments