[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue#1003
Open
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
Open
[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue#1003htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
Conversation
Summary:
Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores
between MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.
The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.
Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1},
with config pruning updated to filter invalid combinations (interleave
requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`).
Perf on B200 (tflops):
```
aten tlx_matmul_ws
(M, N, K) matmul before after delta
(8192, 8192, 8192) 1142.09 1168.46 1182.49 +1.2%
(3159809, 384, 384) 647.23 664.21 644.05 -3.0%
(1152, 12800, 32768) 1124.60 1076.00 1069.98 -0.6%
(1024, 256, 16384) 363.73 209.39 209.72 +0.2%
(560849, 512, 896) 889.47 898.91 938.96 +4.5%
(589824, 512, 2048) 915.12 959.46 959.39 -0.0%
(1152, 65536, 1024) 1071.84 926.53 962.52 +3.9%
(8192, 4608, 6144) 1170.41 1176.01 1195.01 +1.6%
(16384, 11264, 5632) 1089.06 1132.88 1149.37 +1.5%
(8192, 8192, 2048) 1193.88 1141.82 1162.53 +1.8%
average 960.74 935.37 947.40 +1.3%
```
Differential Revision: D94608909
htyu
added a commit
to htyu/triton-1
that referenced
this pull request
Feb 27, 2026
…gue (facebookexperimental#1003) Summary: Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores between MMA group 0 and group 1 instead of draining each group sequentially. This overlaps the TMA store latency of one group with the TMEM read of the other, improving memory throughput on store-bound shapes. The interleaved path is enabled by default for GPU-saturated shapes and for tall-M shapes with small K (low arithmetic intensity). It is disabled for Split-K configs (which use atomic reductions) and for the tall-M high-arithmetic-intensity path with BLOCK_K=128. Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1}, with config pruning updated to filter invalid combinations (interleave requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`). Perf on B200 (tflops): ``` aten tlx_matmul_ws (M, N, K) matmul before after delta (8192, 8192, 8192) 1142.09 1168.46 1182.49 +1.2% (3159809, 384, 384) 647.23 664.21 644.05 -3.0% (1152, 12800, 32768) 1124.60 1076.00 1069.98 -0.6% (1024, 256, 16384) 363.73 209.39 209.72 +0.2% (560849, 512, 896) 889.47 898.91 938.96 +4.5% (589824, 512, 2048) 915.12 959.46 959.39 -0.0% (1152, 65536, 1024) 1071.84 926.53 962.52 +3.9% (8192, 4608, 6144) 1170.41 1176.01 1195.01 +1.6% (16384, 11264, 5632) 1089.06 1132.88 1149.37 +1.5% (8192, 8192, 2048) 1193.88 1141.82 1162.53 +1.8% average 960.74 935.37 947.40 +1.3% ``` Differential Revision: D94608909
htyu
added a commit
to htyu/triton-1
that referenced
this pull request
Feb 28, 2026
…gue (facebookexperimental#1003) Summary: Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores between MMA group 0 and group 1 instead of draining each group sequentially. This overlaps the TMA store latency of one group with the TMEM read of the other, improving memory throughput on store-bound shapes. The interleaved path is enabled by default for GPU-saturated shapes and for tall-M shapes with small K (low arithmetic intensity). It is disabled for Split-K configs (which use atomic reductions) and for the tall-M high-arithmetic-intensity path with BLOCK_K=128. Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1}, with config pruning updated to filter invalid combinations (interleave requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`). Perf on B200 (tflops): ``` aten tlx_matmul_ws (M, N, K) matmul before after delta (8192, 8192, 8192) 1142.09 1168.46 1182.49 +1.2% (3159809, 384, 384) 647.23 664.21 644.05 -3.0% (1152, 12800, 32768) 1124.60 1076.00 1069.98 -0.6% (1024, 256, 16384) 363.73 209.39 209.72 +0.2% (560849, 512, 896) 889.47 898.91 938.96 +4.5% (589824, 512, 2048) 915.12 959.46 959.39 -0.0% (1152, 65536, 1024) 1071.84 926.53 962.52 +3.9% (8192, 4608, 6144) 1170.41 1176.01 1195.01 +1.6% (16384, 11264, 5632) 1089.06 1132.88 1149.37 +1.5% (8192, 8192, 2048) 1193.88 1141.82 1162.53 +1.8% average 960.74 935.37 947.40 +1.3% ``` Differential Revision: D94608909
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When
INTERLEAVE_EPILOGUE=1, the epilogue alternates TMA storesbetween MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.
The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.
Autotuning is also extended to explore
INTERLEAVE_EPILOGUEin {0, 1},with config pruning updated to filter invalid combinations (interleave
requires
NUM_MMA_GROUPS == 2andSPLIT_K == 1).Perf on B200 (tflops):
Differential Revision: D94608909