[2/2] Support MHA prefill with FlashAttention 4. by lifuhuang · Pull Request #10937 · sgl-project/sglang

lifuhuang · 2025-09-26T03:25:56Z

(co-author @hyhieu)

Updates: the kernel change has been checked in separately in: #10940

Motivation

Add support for FA4 MHA prefill, changes mostly based on: #9428

Modifications

Accuracy Tests

lm_eval \
  --model local-chat-completions \
  --model_args model=gpt-oss,base_url=http://127.0.0.1:30010/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
  --tasks gsm8k \
  --batch_size 128 \
  --apply_chat_template \
  --num_fewshot 8

FA4:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	8	exact_match	↑	0.8931	±	0.0085
		strict-match	8	exact_match	↑	0.3419	±	0.0131

Baseline (TRTLLM-MHA):

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	8	exact_match	↑	0.8840	±	0.0088
		strict-match	8	exact_match	↑	0.3207	±	0.0129

Benchmarking and Profiling

For the model I am benchmarking: openai/gpt-oss-20b. I am not seeing significant difference between fa4 and trtllm-mha, both are significantly faster than triton, expectedly.

FA4:

TRTLLM-MHA

Triton

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-26T03:26:09Z

Summary of Changes

Hello @lifuhuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's ability to leverage FlashAttention v4 for prefill operations, aiming to improve performance and efficiency. It achieves this by removing previous hardcoded limitations on fa4 usage and by refactoring the attention backend selection process into a dedicated utility, making the system more robust and easier to configure for different attention mechanisms.

Highlights

FlashAttention v4 Support: FlashAttention v4 (fa4) is now fully supported for Multi-Head Attention (MHA) prefill operations, removing previous restrictions that limited its use to specific model architectures or prevented its application during prefill.
Backend Selection Refactoring: The logic for determining the appropriate prefill and decode attention backends has been centralized into a new utility function, determine_attention_backends, improving code modularity and maintainability.
Expanded Backend Options: The list of supported attention backends for GptOssForCausalLM models has been updated to include 'fa4', allowing for more flexible and potentially faster attention computations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for MHA prefill with FlashAttention 4 by removing assertions that previously restricted its use. The changes also include a refactoring of the attention backend determination logic into a new utility function.

My main feedback is to further improve the refactoring by moving the new determine_attention_backends function into the ServerArgs class as a method. This would improve code structure by eliminating a dependency from the low-level server_args.py module to the large utils.py module, which is better for maintainability. I've left specific suggestions on how to achieve this.

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/server_args.py

python/sglang/srt/utils.py

Co-authored-by: Hieu Pham <hyhieu@gmail.com>

lifuhuang · 2025-09-26T04:17:48Z

sgl-kernel/python/sgl_kernel/flash_attn.py

-        raise NotImplementedError("haven't implemented flash_attn_with_kvcache for fa4")
+        assert (
+            flash_attn_varlen_func_v4 is not None
+        ), "FA4 is not available, please check your installation."


TODO: Need to check in first and bump up sgl-kernel.

waiting for #10940 to merge first.

Swipe4057 · 2025-09-27T05:55:58Z

The H100-H200 series is not supported by FA4, right?

cicirori · 2025-10-01T20:01:20Z

The H100-H200 series is not supported by FA4, right?

It was restricted to sm100 because of we only tested this on blackwell primary optimization object.
and FA4 do support sm90 when page_table is None.
It would be nice if you can tell if we can just enable sm90 in MHA prefill.

zhyncs

@lifuhuang can u upgrade sgl-kernel v0.3.15 in this pr

lifuhuang · 2025-10-08T04:16:43Z

@lifuhuang can u upgrade sgl-kernel v0.3.15 in this pr

done

Co-authored-by: Hieu Pham <hyhieu@gmail.com>

lifuhuang requested review from Ying1123, hnyls2002, ispobock, merrymercy, ping1jing2 and zhyncs as code owners September 26, 2025 03:25

sglang-bot added the run-ci label Sep 26, 2025

FlamingoPg force-pushed the lifu/b200 branch from ee191ca to 666024a Compare September 26, 2025 03:26

FlamingoPg requested review from BBuf, FlamingoPg, HaiShaw and yizhang2077 as code owners September 26, 2025 03:26

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

Support MHA prefill with FlashAttention 4.

0ac0424

Co-authored-by: Hieu Pham <hyhieu@gmail.com>

FlamingoPg force-pushed the lifu/b200 branch from 666024a to 0ac0424 Compare September 26, 2025 03:34

Minor.

433a32c

lifuhuang commented Sep 26, 2025

View reviewed changes

zhyncs approved these changes Sep 26, 2025

View reviewed changes

zhyncs self-assigned this Sep 26, 2025

zhyncs added the high priority label Sep 26, 2025

Merge branch 'main' into lifu/b200

befeb05

lifuhuang mentioned this pull request Sep 26, 2025

[1/2] Support FA4 for MHA Prefill in sgl-kernel #10940

Merged

lifuhuang added the DO NOT MERGE label Sep 26, 2025

lifuhuang changed the title ~~Support MHA prefill with FlashAttention 4.~~ [2/2] Support MHA prefill with FlashAttention 4. Sep 26, 2025

Merge remote-tracking branch 'origin/main' into lifu/b200

19da3a6

lifuhuang requested review from Edwardf0t1, ch-wan and kushanam as code owners September 30, 2025 08:23

lifuhuang removed the DO NOT MERGE label Sep 30, 2025

zhyncs requested changes Oct 7, 2025

View reviewed changes

Bump sgl-kernel to 0.3.15

c3f4db6

lifuhuang force-pushed the lifu/b200 branch from 011e593 to c3f4db6 Compare October 8, 2025 04:12

Merge remote-tracking branch 'origin/main' into lifu/b200

b1444a5

lifuhuang requested a review from zhyncs October 8, 2025 04:16

upd

735c4b7

zhyncs requested review from CatherineSue, JustinTong0323 and slin1237 as code owners October 8, 2025 07:54

zhyncs approved these changes Oct 8, 2025

View reviewed changes

zhyncs merged commit edefab0 into main Oct 8, 2025
6 of 40 checks passed

zhyncs deleted the lifu/b200 branch October 8, 2025 07:54

zhyncs mentioned this pull request Oct 8, 2025

FA4 #9428

Closed

4 tasks

ch-tiger1 pushed a commit to ch-tiger1/sglang that referenced this pull request Oct 9, 2025

[2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)

329a695

Co-authored-by: Hieu Pham <hyhieu@gmail.com>

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025

[2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)

7e413f9

Co-authored-by: Hieu Pham <hyhieu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/2] Support MHA prefill with FlashAttention 4.#10937

[2/2] Support MHA prefill with FlashAttention 4.#10937
zhyncs merged 7 commits intomainfrom
lifu/b200

lifuhuang commented Sep 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang Sep 26, 2025

Uh oh!

lifuhuang Sep 26, 2025 •

edited

Loading

Uh oh!

Swipe4057 commented Sep 27, 2025

Uh oh!

cicirori commented Oct 1, 2025 •

edited

Loading

Uh oh!

zhyncs left a comment

Uh oh!

lifuhuang commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

lifuhuang commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

FA4:

TRTLLM-MHA

Triton

Checklist

Uh oh!

gemini-code-assist bot commented Sep 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

lifuhuang Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Swipe4057 commented Sep 27, 2025

Uh oh!

cicirori commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs left a comment

Choose a reason for hiding this comment

Uh oh!

lifuhuang commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

lifuhuang commented Sep 26, 2025 •

edited

Loading

lifuhuang Sep 26, 2025 •

edited

Loading

cicirori commented Oct 1, 2025 •

edited

Loading