Skip to content

[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue#1003

Open
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu:export-D94608909
Open

[TLX] Interleave TMA stores across MMA groups in Blackwell GEMM epilogue#1003
htyu wants to merge 1 commit intofacebookexperimental:mainfrom
htyu:export-D94608909

Conversation

@htyu
Copy link
Contributor

@htyu htyu commented Feb 27, 2026

Summary:
Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When INTERLEAVE_EPILOGUE=1, the epilogue alternates TMA stores
between MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.

The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.

Autotuning is also extended to explore INTERLEAVE_EPILOGUE in {0, 1},
with config pruning updated to filter invalid combinations (interleave
requires NUM_MMA_GROUPS == 2 and SPLIT_K == 1).

Perf on B200 (tflops):

                         aten      tlx_matmul_ws
           (M, N, K)  matmul   before    after   delta
  (8192, 8192, 8192) 1142.09  1168.46  1182.49  +1.2%
 (3159809, 384, 384)  647.23   664.21   644.05  -3.0%
(1152, 12800, 32768) 1124.60  1076.00  1069.98  -0.6%
  (1024, 256, 16384)  363.73   209.39   209.72  +0.2%
  (560849, 512, 896)  889.47   898.91   938.96  +4.5%
 (589824, 512, 2048)  915.12   959.46   959.39  -0.0%
 (1152, 65536, 1024) 1071.84   926.53   962.52  +3.9%
  (8192, 4608, 6144) 1170.41  1176.01  1195.01  +1.6%
(16384, 11264, 5632) 1089.06  1132.88  1149.37  +1.5%
  (8192, 8192, 2048) 1193.88  1141.82  1162.53  +1.8%
             average  960.74   935.37   947.40  +1.3%

Differential Revision: D94608909

Summary:
Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores
between MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.

The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.

Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1},
with config pruning updated to filter invalid combinations (interleave
requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`).

Perf on B200 (tflops):

```
                         aten      tlx_matmul_ws
           (M, N, K)  matmul   before    after   delta
  (8192, 8192, 8192) 1142.09  1168.46  1182.49  +1.2%
 (3159809, 384, 384)  647.23   664.21   644.05  -3.0%
(1152, 12800, 32768) 1124.60  1076.00  1069.98  -0.6%
  (1024, 256, 16384)  363.73   209.39   209.72  +0.2%
  (560849, 512, 896)  889.47   898.91   938.96  +4.5%
 (589824, 512, 2048)  915.12   959.46   959.39  -0.0%
 (1152, 65536, 1024) 1071.84   926.53   962.52  +3.9%
  (8192, 4608, 6144) 1170.41  1176.01  1195.01  +1.6%
(16384, 11264, 5632) 1089.06  1132.88  1149.37  +1.5%
  (8192, 8192, 2048) 1193.88  1141.82  1162.53  +1.8%
             average  960.74   935.37   947.40  +1.3%
```

Differential Revision: D94608909
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 27, 2026

@htyu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94608909.

@htyu htyu requested a review from dshi7 February 27, 2026 21:21
htyu added a commit to htyu/triton-1 that referenced this pull request Feb 27, 2026
…gue (facebookexperimental#1003)

Summary:

Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores
between MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.

The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.

Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1},
with config pruning updated to filter invalid combinations (interleave
requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`).

Perf on B200 (tflops):

```
                         aten      tlx_matmul_ws
           (M, N, K)  matmul   before    after   delta
  (8192, 8192, 8192) 1142.09  1168.46  1182.49  +1.2%
 (3159809, 384, 384)  647.23   664.21   644.05  -3.0%
(1152, 12800, 32768) 1124.60  1076.00  1069.98  -0.6%
  (1024, 256, 16384)  363.73   209.39   209.72  +0.2%
  (560849, 512, 896)  889.47   898.91   938.96  +4.5%
 (589824, 512, 2048)  915.12   959.46   959.39  -0.0%
 (1152, 65536, 1024) 1071.84   926.53   962.52  +3.9%
  (8192, 4608, 6144) 1170.41  1176.01  1195.01  +1.6%
(16384, 11264, 5632) 1089.06  1132.88  1149.37  +1.5%
  (8192, 8192, 2048) 1193.88  1141.82  1162.53  +1.8%
             average  960.74   935.37   947.40  +1.3%
```

Differential Revision: D94608909
htyu added a commit to htyu/triton-1 that referenced this pull request Feb 28, 2026
…gue (facebookexperimental#1003)

Summary:

Add an interleaved epilogue mode to the Blackwell warp-specialized GEMM
kernel. When `INTERLEAVE_EPILOGUE=1`, the epilogue alternates TMA stores
between MMA group 0 and group 1 instead of draining each group
sequentially. This overlaps the TMA store latency of one group with the
TMEM read of the other, improving memory throughput on store-bound shapes.

The interleaved path is enabled by default for GPU-saturated shapes and
for tall-M shapes with small K (low arithmetic intensity). It is disabled
for Split-K configs (which use atomic reductions) and for the tall-M
high-arithmetic-intensity path with BLOCK_K=128.

Autotuning is also extended to explore `INTERLEAVE_EPILOGUE` in {0, 1},
with config pruning updated to filter invalid combinations (interleave
requires `NUM_MMA_GROUPS == 2` and `SPLIT_K == 1`).

Perf on B200 (tflops):

```
                         aten      tlx_matmul_ws
           (M, N, K)  matmul   before    after   delta
  (8192, 8192, 8192) 1142.09  1168.46  1182.49  +1.2%
 (3159809, 384, 384)  647.23   664.21   644.05  -3.0%
(1152, 12800, 32768) 1124.60  1076.00  1069.98  -0.6%
  (1024, 256, 16384)  363.73   209.39   209.72  +0.2%
  (560849, 512, 896)  889.47   898.91   938.96  +4.5%
 (589824, 512, 2048)  915.12   959.46   959.39  -0.0%
 (1152, 65536, 1024) 1071.84   926.53   962.52  +3.9%
  (8192, 4608, 6144) 1170.41  1176.01  1195.01  +1.6%
(16384, 11264, 5632) 1089.06  1132.88  1149.37  +1.5%
  (8192, 8192, 2048) 1193.88  1141.82  1162.53  +1.8%
             average  960.74   935.37   947.40  +1.3%
```

Differential Revision: D94608909
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant