Sync with upstream preserving down stream FA3 code#115
Merged
LucasWilkinson merged 473 commits intomainfrom Jan 29, 2026
Merged
Sync with upstream preserving down stream FA3 code#115LucasWilkinson merged 473 commits intomainfrom
LucasWilkinson merged 473 commits intomainfrom
Conversation
This hasn't been used since 2023-09
This fixes Commit 81cdf4c
Credit: Jay Shah's idea
…Lab#1795) When the parameter `cache_seqlen` is scalar, it should be expand to vector of shape (batch_size). In the original code, whenever `block_table` is used, the shape of `k_cache` is (num_blocks, page_size, ...), and thus `cache_seqlen` is expanded to shape (num_blocks) instead of (batch_size), which is wrong. This fix uses the shape of `q`, which is always `batch_size`.
Actually doesn't seem to make it faster
* use LPT order in varlen kernel * add prefill decode benchmark script * add sort in prepare * add full implementation: * add varlen kvhead swizzle * add settings for swizzle ablation * add correction term for sort when causal * remove ablation options from frontend and clean up comments * add comments in prepare kernel * remove debug code and scripts * put back defaults in tests * remove excess Nones returned in python interface for varlen * revert opinionated change to setup.py on cuda version 12.9 * force inline sort op and make east const * more templating in varlen scheduler to cure some register spilling * fix exploding build by splitting compilation and add qol macros for hdimdiff * fix metadata mismatch with seqlenk in test script * extend prepare kernel to >992 batches and always call it for varlen * do inter-batch sort per every 992 batches * better names in combine and fix prepare condition in api
Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.
When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.
* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827) * ci: Allow build/deploy of arbitrary configurations Signed-off-by: oliver könig <okoenig@nvidia.com> * add Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanui Signed-off-by: oliver könig <okoenig@nvidia.com> * cxx11_abi Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * upload Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>
* lse output * style * style * revert test changes, introduce optional kwarg to output lse
* [BugFix] fix softcap condition softcap should only be referenced when its not none, currently the logic is reversed and will result in an error * [BugFix] fix sm80 cuteDSL error 1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100 2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces. * Fix typo of range_constexpr * Fix seqlen
…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs
Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.
* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml
* squashed * fixes * fixes * Fix narrow * Add TORCH_STABLE_ONLY flag * new_empty + zero_ --> new_zeros * revert flash_api.cpp and add flash_api_stable.cpp * update setup.py * Only pass TORCH_STABLE_ONLY for stable build * Address Jane's comments * > to >=
Improve flash.cute paged_kv cpasync
…ILab#2174) * update row_max before safe overwrite * move up row_max_prev
…ab#2104) * fully shard paged KV address calculation across threads * use t0 indices for static bound checking * increase tiled copy to full KV row * shrink predicate tensor * clarify paged KV divisibility constraints * increase load register allocation
Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.
Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention.
[Cute,Fwd,Sm100] Add r2p for local mask
…ab#2194) * fix * same fix for bwd and SM80
Conflict resolution strategy: - hopper/ kernel files: Keep downstream (n_offset, CP, varlen combine) - hopper/ API files: Take upstream bindings - csrc/flash_attn_ck/: Take upstream (AMD priority) - Deleted hopper/flash_api_torch_lib.cpp (upstream has torch bindings) - Preserved prepare_seqlen_q_ptr instead of num_m_blocks_ptr - Added attention_chunk field to flash.h (API compat, unused in kernels)
Upstream flash_api.cpp already has torch bindings, so this file is no longer needed.
- Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API) - Restore static_switch.h from downstream (has QV_SWITCH macro)
Using downstream's hopper code (with n_offset, CP, varlen combine) for full compatibility. Upstream changes are kept in non-hopper files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.