Update extend/decode attention kernel for CPU in sgl-kernel and add UTs by yanbing-j · Pull Request #6405 · sgl-project/sglang

yanbing-j · 2025-05-19T03:05:04Z

Motivation

This PR is a follow-up on #2807 and #5150 to update extend/decode attention kernel for CPU. We fuse set_kv_buffer in decode attention, in order to reduce the overhead. We also add correspondig UTs test_extend.py/test_decode.py for extend/decode attention kernels for CPU.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…and UTs

* Fuse decode attention with kv buffer * Add vec of copy * Update MHA part * update * Move output outside of kernel * update

* Support num_prompts > 1 in serving mode * Update

…Ts (sgl-project#6405) Co-authored-by: mingfeima <mingfei.ma@intel.com>

yanbing-j force-pushed the yanbing/attention_kernel branch 2 times, most recently from 4666cf8 to 816b4f7 Compare May 19, 2025 08:22

yanbing-j and others added 7 commits May 19, 2025 08:26

k/v supports non-contiguous tensors, update extend_attention calling …

816b4f7

…and UTs

Fuse decode attention with kv buffer (sgl-project#31)

c9993c3

* Fuse decode attention with kv buffer * Add vec of copy * Update MHA part * update * Move output outside of kernel * update

Support num_prompts > 1 in serving mode (sgl-project#39)

0d17660

* Support num_prompts > 1 in serving mode * Update

decode_attention: fix bug when num_heads / num_heads_kv is an odd number

16f25a7

Remove unnecessary key contiguous check (sgl-project#48)

a2d6c99

Update loc to int64_t

77ffc37

Add test_extend and test_decode UTs

25b9c2a

mingfeima approved these changes May 20, 2025

View reviewed changes

mingfeima marked this pull request as ready for review May 20, 2025 01:51

mingfeima requested review from BBuf, FlamingoPg, HandH1998, Ying1123, ispobock, merrymercy, yizhang2077 and zhyncs as code owners May 20, 2025 01:51

Merge branch 'main' into yanbing/attention_kernel

0f37bbd

This was referenced May 20, 2025

[Feature] RFC for adding CPU support for SGLang #2807

Closed

Add intel_amx backend for Radix Attention #6143

Closed

Merge branch 'main' into yanbing/attention_kernel

1554c9e

zhyncs merged commit 32cc66e into sgl-project:main May 20, 2025

mingfeima added sgl-kernel intel cpu cpu backend performance optimization labels May 21, 2025

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

Update extend/decode attention kernel for CPU in sgl-kernel and add U…

8a5c253

…Ts (sgl-project#6405) Co-authored-by: mingfeima <mingfei.ma@intel.com>

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

Update extend/decode attention kernel for CPU in sgl-kernel and add U…

1582bfb

…Ts (sgl-project#6405) Co-authored-by: mingfeima <mingfei.ma@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update extend/decode attention kernel for CPU in sgl-kernel and add UTs#6405

Update extend/decode attention kernel for CPU in sgl-kernel and add UTs#6405
zhyncs merged 9 commits intosgl-project:mainfrom
yanbing-j:yanbing/attention_kernel

yanbing-j commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yanbing-j commented May 19, 2025

Motivation

Modifications

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants