Conversation
a47f544 to
31782db
Compare
| // loop has one shared reuse accumCnt, not one per inner loop. | ||
| SmallVector<Operation *> dummy; | ||
| // Track seen ops for the reuse group section. | ||
| DenseSet<Operation *> seenOps; |
There was a problem hiding this comment.
The fix tries to separately handle accumCnts for perRegion and per reuse group case, and thus we can remove seenOps.
| // Use rowSize for the N-dimension extent, since colOffset is an | ||
| // N-dimension position. | ||
| Interval candSizeRange = {colOffset, colOffset + cand->rowSize}; | ||
| Interval allocSizeRange = {alloc->colOffset, |
There was a problem hiding this comment.
colOffset is an N-dimension (TMEM row) position, while colSize is an M-dimension extent. Using rowSize (N-dimension extent) to advance the position resolves the "can't find TMEM space" issue
There was a problem hiding this comment.
I am a little confused about the colSize vs. rowSize. Was it working for GEMM and non-causal fwd just by chance? :]
There was a problem hiding this comment.
sorry, this is related to one of the backport PR #975:

I realized it is better to swap the argument during instantiation, so I reverted rowSize change and changed ttng::TMemAllocation(allocSize.numRows, allocSize.numCols)}); in the new commit.
| if (!alloc->isOwnerOfSpace && alloc->reuseOwner == reuseOwner) { | ||
| if (sameLoop(alloc) || | ||
| bufferRange[alloc].intersects(bufferRange[cand])) | ||
| if (bufferRange[alloc].intersects(bufferRange[cand])) { |
There was a problem hiding this comment.
relax the sameLoop(alloc) here to support two sequential loops in causal FA.
There was a problem hiding this comment.
Need to refresh my memory on causal now :] For causal, do we call allocateTMem twice? Once on the first innermost loop, then on the 2nd innermost loop. So when running on the 2nd innermost loop, checking against allocs for the 1st inner loop, sameLoop should return false?
There was a problem hiding this comment.
I see. I reverted this change in the new commits. Here it seems a simplification rather than necessary part of the fix -- for cross-loop allocs, sameLoop is false so only intersects is evaluated; for same-loop allocs, seems in our use case intersects returns true for local allocs that are live through the loop body, so removing sameLoop did not break anything.
| idTypes[bufferIdInnermostSplit] = elemType; | ||
| ++bufferId; | ||
| } | ||
| assignedId = bufferIdInnermostSplit; |
There was a problem hiding this comment.
Enable sharing for split buffers to avoid SMEM out of space error.
There was a problem hiding this comment.
The current logic in CodePartitioning is that if buffer.copy is 1, reuse group will use separate barriers, if it is > 1, reuse group will share the same array of barriers. CC @njriasan Did you add this logic for gemm? I wonder if we have a pytest of gemm to make sure this will not regress your use case. We can check in a shell script to run all tests locally for now.
This PR tries to make the tritonbench kernel blackwell_triton_fused_attention_dp.py causal mode work with autoWS. It also needs the backport fix PR #959 and #989.
Test plan
In tritonbench, run
Result using devgpu-31.atn1:
Tested the change does not impact the non-causal version:
In tritonbench, run
Result using devgpu-31.atn1: