Support overlap-spec-v2 with fa3 attention backend#12128
Support overlap-spec-v2 with fa3 attention backend#12128b8zhong wants to merge 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for overlap-spec-v2 with the fa3 attention backend by introducing a new DRAFT_EXTEND_V2 forward mode. The changes correctly handle this new mode in most metadata initialization functions. However, I've identified a potential issue in the CUDA graph replay logic where the new logic for DRAFT_EXTEND_V2 is incorrectly applied to the existing DRAFT_EXTEND mode. This could lead to incorrect behavior if that code path is used. My review includes a detailed explanation and a suggested fix for this issue.
| ) | ||
|
|
||
| elif forward_mode.is_draft_extend(): | ||
| elif forward_mode.is_draft_extend(include_v2=True): |
There was a problem hiding this comment.
While this change correctly includes DRAFT_EXTEND_V2, the logic within this elif block seems to incorrectly handle the original DRAFT_EXTEND mode.
The block now updates max_seq_len_q and cu_seqlens_q based on spec_info.accept_length (lines 1957-1965), which is specific to DRAFT_EXTEND_V2. For the original DRAFT_EXTEND mode, these parameters are static and should not be updated during replay. This could lead to incorrect behavior if DRAFT_EXTEND mode is used with CUDA graphs.
I suggest wrapping the logic that uses accept_length in a condition that checks for DRAFT_EXTEND_V2 mode specifically. Here is a suggested implementation for the whole block:
elif forward_mode.is_draft_extend(include_v2=True):
metadata = self.draft_extend_metadata[bs]
metadata.cache_seqlens_int32.copy_(seq_lens)
metadata.max_seq_len_k = seq_lens_cpu.max().item()
metadata.cu_seqlens_k[1:].copy_(
torch.cumsum(metadata.cache_seqlens_int32, dim=0, dtype=torch.int32)
)
if forward_mode.is_draft_extend_v2():
accept_length = spec_info.accept_length[:bs]
if spec_info.accept_length_cpu:
metadata.max_seq_len_q = max(spec_info.accept_length_cpu) + 1
else:
metadata.max_seq_len_q = 1
metadata.cu_seqlens_q[1:].copy_(
torch.cumsum(accept_length, dim=0, dtype=torch.int32)
)|
Acc seems dropped, check CI |
|
btw, check #12113 |
|
could you share the profile to make sure it's overlapped correctly? |
Motivation
Enable it on FA3 MLA, which is the default for dpsk + hopper
Modifications
The speed of
python3 -m sglang.test.send_onefromTo
So around a 13% improvement. However, there seems to be kind of slow kernels, which I will address seperately.