Sync with upstream preserving down stream FA3 code by LucasWilkinson · Pull Request #115 · vllm-project/flash-attention

LucasWilkinson · 2026-01-23T05:58:09Z

No description provided.

This hasn't been used since 2023-09

This fixes Commit 81cdf4c

Credit: Jay Shah's idea

…Lab#1795) When the parameter `cache_seqlen` is scalar, it should be expand to vector of shape (batch_size). In the original code, whenever `block_table` is used, the shape of `k_cache` is (num_blocks, page_size, ...), and thus `cache_seqlen` is expanded to shape (num_blocks) instead of (batch_size), which is wrong. This fix uses the shape of `q`, which is always `batch_size`.

Actually doesn't seem to make it faster

* use LPT order in varlen kernel * add prefill decode benchmark script * add sort in prepare * add full implementation: * add varlen kvhead swizzle * add settings for swizzle ablation * add correction term for sort when causal * remove ablation options from frontend and clean up comments * add comments in prepare kernel * remove debug code and scripts * put back defaults in tests * remove excess Nones returned in python interface for varlen * revert opinionated change to setup.py on cuda version 12.9 * force inline sort op and make east const * more templating in varlen scheduler to cure some register spilling * fix exploding build by splitting compilation and add qol macros for hdimdiff * fix metadata mismatch with seqlenk in test script * extend prepare kernel to >992 batches and always call it for varlen * do inter-batch sort per every 992 batches * better names in combine and fix prepare condition in api

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827) * ci: Allow build/deploy of arbitrary configurations Signed-off-by: oliver könig <okoenig@nvidia.com> * add Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanui Signed-off-by: oliver könig <okoenig@nvidia.com> * cxx11_abi Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * upload Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* lse output * style * style * revert test changes, introduce optional kwarg to output lse

* [BugFix] fix softcap condition softcap should only be referenced when its not none, currently the logic is reversed and will result in an error * [BugFix] fix sm80 cuteDSL error 1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100 2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces. * Fix typo of range_constexpr * Fix seqlen

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml

…1881)

* squashed * fixes * fixes * Fix narrow * Add TORCH_STABLE_ONLY flag * new_empty + zero_ --> new_zeros * revert flash_api.cpp and add flash_api_stable.cpp * update setup.py * Only pass TORCH_STABLE_ONLY for stable build * Address Jane's comments * > to >=

Improve flash.cute paged_kv cpasync

) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes

…ILab#2174) * update row_max before safe overwrite * move up row_max_prev

…ab#2104) * fully shard paged KV address calculation across threads * use t0 indices for static bound checking * increase tiled copy to full KV row * shrink predicate tensor * clarify paged KV divisibility constraints * increase load register allocation

Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.

…vert to C long"

[Cute,Fwd,Sm100] Add r2p for local mask

…ab#2194) * fix * same fix for bwd and SM80

…e cases (Dao-AILab#2187)

Conflict resolution strategy: - hopper/ kernel files: Keep downstream (n_offset, CP, varlen combine) - hopper/ API files: Take upstream bindings - csrc/flash_attn_ck/: Take upstream (AMD priority) - Deleted hopper/flash_api_torch_lib.cpp (upstream has torch bindings) - Preserved prepare_seqlen_q_ptr instead of num_m_blocks_ptr - Added attention_chunk field to flash.h (API compat, unused in kernels)

Upstream flash_api.cpp already has torch bindings, so this file is no longer needed.

- Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API) - Restore static_switch.h from downstream (has QV_SWITCH macro)

Using downstream's hopper code (with n_offset, CP, varlen combine) for full compatibility. Upstream changes are kept in non-hopper files.

tridao and others added 30 commits August 12, 2025 11:26

Remove old xentropy kernel

c4be578

This hasn't been used since 2023-09

Remove old fused softmax kernel from apex/Megatron

3edef7c

Remove old attn decode kernel from FasterTransformer

2715c53

Remove old rotary kernel

f28841d

[Cute] Implement page table with TMA for fwd_sm100

a1c2e22

[Cute] Remove trailing bracket (Dao-AILab#1809)

581b68d

This fixes Commit 81cdf4c

[Cute] Make sure R2P happen

3c51f15

feat: add support for pytorch2.8 (Dao-AILab#1801)

d2e3fc3

[Cute] Implement PackGQA with TMA for fwd_sm100

69b33b5

Credit: Jay Shah's idea

Bump to v2.8.3

060c918

[Cute] Port fwd_combine kernel from C++ to cute-dsl

b31ae1e

[Cute] Simplify tile scheduler storing params

591dc7e

[Cute] Implement sink for fwd_sm90

f8b4f15

[Cute] Implement PackGQA with TMA for fwd_sm90

e1407db

[Cute] Use R2P for masking in fwd_sm90

0e60e39

Actually doesn't seem to make it faster

Fixes incorrect variable reference in comment (Dao-AILab#1775)

632fe2a

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

Update the initialization of dk/dv_semaphore (Dao-AILab#1839)

832d544

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

Update tile_scheduler.hpp (Dao-AILab#1841)

478841a

ci: Switch to workflow_dispatch (Dao-AILab#1847)

d0ed097

[FA3] Allow returning LSE via kwarg (Dao-AILab#1851)

203b9b3

* lse output * style * style * revert test changes, introduce optional kwarg to output lse

[FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuT…

6387433

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

benchmark: qualify all attention backends by methods list (Dao-AILab#…

e8c7344

…1881)

v0i0 and others added 28 commits January 12, 2026 10:10

Merge pull request Dao-AILab#2156 from v0i0/v0i0/improve-paged-ldgsts

dbf08eb

Improve flash.cute paged_kv cpasync

[Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171

4cb272e

) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes

[Cute][Flex] Remove no longer needed contig (Dao-AILab#2172)

4894657

[Cute] update row_max before safe overwrite for online_softmax (Dao-A…

13696f2

…ILab#2174) * update row_max before safe overwrite * move up row_max_prev

[Cute][Flex] add back in contig (Dao-AILab#2177)

506441a

[Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180)

68649fb

baseline local flops

88067b0

Add R2P dual bound masking for local attention

a512bd8

Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.

remove benchmark result, undo changes to benchmark

2020964

Add R2P dual bound masking for local attention

7108d1c

Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.

switch from xor to mask_right & ~ mask_left

e4ec1ad

flip in_bound to out_bound

ac88858

remove zero logic for right_s and left_s

e34d840

remove 24 clamp

08e6518

doc

94f0348

lint

e94012a

added back clamp to avoid "OverflowError: Python int too large to con…

2e6ae05

…vert to C long"

add comment

137ad8e

Merge pull request Dao-AILab#2185 from henrylhtsang/test_local_r2p

2d6b146

[Cute,Fwd,Sm100] Add r2p for local mask

[Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189)

a0f9f41

[Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AIL…

04e6ee1

…ab#2194) * fix * same fix for bwd and SM80

reduce chance of build oom (Dao-AILab#2079)

f15ccf5

[Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edg…

2580b5a

…e cases (Dao-AILab#2187)

Remove hopper/flash_api_torch_lib.cpp from CMakeLists.txt

46a9b8d

Upstream flash_api.cpp already has torch bindings, so this file is no longer needed.

Fix compatibility between upstream flash_api.cpp and downstream flash.h

dbab130

- Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API) - Restore static_switch.h from downstream (has QV_SWITCH macro)

Restore entire hopper/ folder from downstream

fd3b0bb

Using downstream's hopper code (with n_offset, CP, varlen combine) for full compatibility. Upstream changes are kept in non-hopper files.

LucasWilkinson changed the title ~~Sync/upstream main v2 20260123~~ Sync with upstream preserving down stream FA3 code Jan 23, 2026

LucasWilkinson merged commit 7d346be into main Jan 29, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with upstream preserving down stream FA3 code#115

Sync with upstream preserving down stream FA3 code#115
LucasWilkinson merged 473 commits intomainfrom
sync/upstream-main-v2-20260123

LucasWilkinson commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

LucasWilkinson commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants