Skip to content

Sync with upstream preserving down stream FA3 code#115

Merged
LucasWilkinson merged 473 commits intomainfrom
sync/upstream-main-v2-20260123
Jan 29, 2026
Merged

Sync with upstream preserving down stream FA3 code#115
LucasWilkinson merged 473 commits intomainfrom
sync/upstream-main-v2-20260123

Conversation

@LucasWilkinson
Copy link
Collaborator

No description provided.

tridao and others added 30 commits August 12, 2025 11:26
This hasn't been used since 2023-09
…Lab#1795)

When the parameter `cache_seqlen` is scalar, it should be expand to
vector of shape (batch_size).  In the original code, whenever `block_table`
is used, the shape of `k_cache` is (num_blocks, page_size, ...), and
thus `cache_seqlen` is expanded to shape (num_blocks) instead of
(batch_size), which is wrong.  This fix uses the shape of `q`, which
is always `batch_size`.
Actually doesn't seem to make it faster
* use LPT order in varlen kernel

* add prefill decode benchmark script

* add sort in prepare

* add full implementation:

* add varlen kvhead swizzle

* add settings for swizzle ablation

* add correction term for sort when causal

* remove ablation options from frontend and clean up comments

* add comments in prepare kernel

* remove debug code and scripts

* put back defaults in tests

* remove excess Nones returned in python interface for varlen

* revert opinionated change to setup.py on cuda version 12.9

* force inline sort op and make east const

* more templating in varlen scheduler to cure some register spilling

* fix exploding build by splitting compilation and add qol macros for hdimdiff

* fix metadata mismatch with seqlenk in test script

* extend prepare kernel to >992 batches and always call it for varlen

* do inter-batch sort per every 992 batches

* better names in combine and fix prepare condition in api
Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.
When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.
* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827)

* ci: Allow build/deploy of arbitrary configurations

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanui

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cxx11_abi

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upload

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* lse output

* style

* style

* revert test changes, introduce optional kwarg to output lse
* [BugFix] fix softcap condition

softcap should only be referenced when its not none, currently the logic is reversed and will result in an error

* [BugFix] fix sm80 cuteDSL error


1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100
2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces.

* Fix typo of range_constexpr

* Fix seqlen
…e DSL (Dao-AILab#1858)

* update num_threads based on num wgs

* fix bug when not intra_wg_overlap and not mma_pv_is_rs
Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0
when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128),
leading to a compiler failure during barrier initialization. Changed to round-up
division to ensure a minimum value of 1.
* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* drop 12.4

* drop 12.4

* fix correct name

* fix correct name

* fix correct name

* fix correct name

* cibuildwheel.yml
* squashed

* fixes

* fixes

* Fix narrow

* Add TORCH_STABLE_ONLY flag

* new_empty + zero_ --> new_zeros

* revert flash_api.cpp and add flash_api_stable.cpp

* update setup.py

* Only pass TORCH_STABLE_ONLY for stable build

* Address Jane's comments

* > to >=
v0i0 and others added 28 commits January 12, 2026 10:10
)

* add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing

* remove unnecessary reformatting

* reinstate changes
…ILab#2174)

* update row_max before safe overwrite

* move up row_max_prev
…ab#2104)

* fully shard paged KV address calculation across threads

* use t0 indices for static bound checking

* increase tiled copy to full KV row

* shrink predicate tensor

* clarify paged KV divisibility constraints

* increase load register allocation
Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.
Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.
[Cute,Fwd,Sm100] Add r2p for local mask
Conflict resolution strategy:
- hopper/ kernel files: Keep downstream (n_offset, CP, varlen combine)
- hopper/ API files: Take upstream bindings
- csrc/flash_attn_ck/: Take upstream (AMD priority)
- Deleted hopper/flash_api_torch_lib.cpp (upstream has torch bindings)
- Preserved prepare_seqlen_q_ptr instead of num_m_blocks_ptr
- Added attention_chunk field to flash.h (API compat, unused in kernels)
Upstream flash_api.cpp already has torch bindings, so this file is no longer needed.
- Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API)
- Restore static_switch.h from downstream (has QV_SWITCH macro)
Using downstream's hopper code (with n_offset, CP, varlen combine) for full
compatibility. Upstream changes are kept in non-hopper files.
@LucasWilkinson LucasWilkinson changed the title Sync/upstream main v2 20260123 Sync with upstream preserving down stream FA3 code Jan 23, 2026
@LucasWilkinson LucasWilkinson merged commit 7d346be into main Jan 29, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.