Skip to content

Fix vanilla and torch attention cu_seqlens handling#310

Open
Mr-Neutr0n wants to merge 1 commit intoTencent-Hunyuan:mainfrom
Mr-Neutr0n:fix/attention-cu-seqlens-handling
Open

Fix vanilla and torch attention cu_seqlens handling#310
Mr-Neutr0n wants to merge 1 commit intoTencent-Hunyuan:mainfrom
Mr-Neutr0n:fix/attention-cu-seqlens-handling

Conversation

@Mr-Neutr0n
Copy link

Summary

  • Vanilla attention (mode="vanilla") completely ignores cu_seqlens_q/cu_seqlens_kv when provided, allowing attention to bleed across segment boundaries (valid tokens attend to padding tokens and vice versa). This causes the quality degradation reported in [BUG] attention 使用 vanilla 版本质量差上许多 #296. Fixed by building a block-diagonal attention mask from cu_seqlens that prevents cross-segment attention.
  • Torch attention (mode="torch") hardcodes cu_seqlens_q[1] as a single split point, which only works for batch_size=1. For batch_size > 1, the remaining segment boundaries in cu_seqlens are silently ignored. Fixed by iterating over all batch items using the full cu_seqlens array.

Details

flash_attn_varlen_func in the flash mode correctly handles variable-length sequences via cu_seqlens, separating valid (image + text) tokens from padding tokens. The vanilla and torch code paths did not replicate this behavior:

Vanilla mode computed attention over the entire concatenated sequence with no masking at all. When cu_seqlens encodes a valid segment [0, s) and a padding segment [s, max_len), all tokens could freely attend to each other, corrupting the output.

Torch mode used cu_seqlens_q[1] to split into exactly two segments. The cu_seqlens array has 2 * batch_size + 1 entries (a valid/padding pair per batch item), so the two-segment split is only correct when batch_size == 1.

Test plan

  • Verify vanilla attention output matches flash attention output (within numerical tolerance) for batch_size=1 with padding tokens
  • Verify torch attention output matches flash attention output for batch_size > 1
  • Confirm generation quality with mode="vanilla" is comparable to mode="flash" (addresses [BUG] attention 使用 vanilla 版本质量差上许多 #296)

The vanilla attention mode completely ignored cu_seqlens_q/cu_seqlens_kv
when they were provided, computing attention over the entire concatenated
sequence including padding tokens. This caused valid image/text tokens to
attend to padding tokens and vice versa, leading to severe quality
degradation compared to flash attention (see Tencent-Hunyuan#296).

The torch attention mode hardcoded cu_seqlens_q[1] as a single split
point, which only worked correctly for batch_size=1. For batch_size > 1
the cu_seqlens array contains multiple segment boundaries that were
silently ignored, producing incorrect attention outputs.

Changes:
- vanilla mode: build a block-diagonal attention mask from cu_seqlens
  that prevents cross-segment attention between valid and padding tokens
- torch mode: iterate over all batch items using the full cu_seqlens
  array to correctly split valid/padding segments per sample
@tencent-adm
Copy link

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Mr-Neutr0n
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants