[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue by htyu · Pull Request #1003 · facebookexperimental/triton

htyu · 2026-02-27T21:13:51Z

Summary:
Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When INTERLEAVE_EPILOGUE=1, the epilogue alternates TMA stores
between MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.

The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.

Autotuning is also extended to explore INTERLEAVE_EPILOGUE in {0, 1},
with config pruning updated to filter invalid combinations (interleave
requires NUM_MMA_GROUPS == 2 and SPLIT_K == 1).

Perf on B200 (tflops):

                         aten      tlx_matmul_ws
           (M, N, K)  matmul   before    after   delta
  (8192, 8192, 8192) 1142.09  1168.46  1182.49  +1.2%
 (3159809, 384, 384)  647.23   664.21   644.05  -3.0%
(1152, 12800, 32768) 1124.60  1076.00  1069.98  -0.6%
  (1024, 256, 16384)  363.73   209.39   209.72  +0.2%
  (560849, 512, 896)  889.47   898.91   938.96  +4.5%
 (589824, 512, 2048)  915.12   959.46   959.39  -0.0%
 (1152, 65536, 1024) 1071.84   926.53   962.52  +3.9%
  (8192, 4608, 6144) 1170.41  1176.01  1195.01  +1.6%
(16384, 11264, 5632) 1089.06  1132.88  1149.37  +1.5%
  (8192, 8192, 2048) 1193.88  1141.82  1162.53  +1.8%
             average  960.74   935.37   947.40  +1.3%

Differential Revision: D94608909

Summary: Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores between MMA group 0 and group 1 instead of draining each group sequentially. This overlaps the TMA store latency of one group with the TMEM read of the other, improving memory throughput on store-bound shapes. The interleaved path is enabled by default for GPU-saturated shapes and for tall-M shapes with small K (low arithmetic intensity). It is disabled for Split-K configs (which use atomic reductions) and for the tall-M high-arithmetic-intensity path with BLOCK_K=128. Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1}, with config pruning updated to filter invalid combinations (interleave requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`). Perf on B200 (tflops): ``` aten tlx_matmul_ws (M, N, K) matmul before after delta (8192, 8192, 8192) 1142.09 1168.46 1182.49 +1.2% (3159809, 384, 384) 647.23 664.21 644.05 -3.0% (1152, 12800, 32768) 1124.60 1076.00 1069.98 -0.6% (1024, 256, 16384) 363.73 209.39 209.72 +0.2% (560849, 512, 896) 889.47 898.91 938.96 +4.5% (589824, 512, 2048) 915.12 959.46 959.39 -0.0% (1152, 65536, 1024) 1071.84 926.53 962.52 +3.9% (8192, 4608, 6144) 1170.41 1176.01 1195.01 +1.6% (16384, 11264, 5632) 1089.06 1132.88 1149.37 +1.5% (8192, 8192, 2048) 1193.88 1141.82 1162.53 +1.8% average 960.74 935.37 947.40 +1.3% ``` Differential Revision: D94608909

meta-codesync · 2026-02-27T21:13:58Z

@htyu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94608909.

…gue (facebookexperimental#1003) Summary: Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores between MMA group 0 and group 1 instead of draining each group sequentially. This overlaps the TMA store latency of one group with the TMEM read of the other, improving memory throughput on store-bound shapes. The interleaved path is enabled by default for GPU-saturated shapes and for tall-M shapes with small K (low arithmetic intensity). It is disabled for Split-K configs (which use atomic reductions) and for the tall-M high-arithmetic-intensity path with BLOCK_K=128. Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1}, with config pruning updated to filter invalid combinations (interleave requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`). Perf on B200 (tflops): ``` aten tlx_matmul_ws (M, N, K) matmul before after delta (8192, 8192, 8192) 1142.09 1168.46 1182.49 +1.2% (3159809, 384, 384) 647.23 664.21 644.05 -3.0% (1152, 12800, 32768) 1124.60 1076.00 1069.98 -0.6% (1024, 256, 16384) 363.73 209.39 209.72 +0.2% (560849, 512, 896) 889.47 898.91 938.96 +4.5% (589824, 512, 2048) 915.12 959.46 959.39 -0.0% (1152, 65536, 1024) 1071.84 926.53 962.52 +3.9% (8192, 4608, 6144) 1170.41 1176.01 1195.01 +1.6% (16384, 11264, 5632) 1089.06 1132.88 1149.37 +1.5% (8192, 8192, 2048) 1193.88 1141.82 1162.53 +1.8% average 960.74 935.37 947.40 +1.3% ``` Differential Revision: D94608909

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 27, 2026

htyu requested a review from dshi7 February 27, 2026 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue#1003

[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue#1003
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu:export-D94608909

htyu commented Feb 27, 2026

Uh oh!

meta-codesync bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

htyu commented Feb 27, 2026

Uh oh!

meta-codesync bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant