Max calculation inside of forward creates CUDA sync

In the attention mechanism’s forward pass, we compute `max(x_q_lens)` and `max(x_kv_lens)` in order to call `flash_attn_varlen_func`. This max operation triggers a CUDA synchronization, which prevents effective use of `torch.compile`. Since x_q_lens and x_kv_lens are inputs (not intermediates), it’s recommended to move this CUDA sync as far up the call stack as possible (i.e., handle it at a higher level) to minimize its impact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Max calculation inside of forward creates CUDA sync #1531

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Max calculation inside of forward creates CUDA sync #1531

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions