[GPU] Implement CDNA block intrinsics (2/6) by nirvedhmeshram · Pull Request #23963 · iree-org/iree

nirvedhmeshram · 2026-03-30T19:23:32Z

Update GPUPackToIntrinsics and convertContractionToInnerTiledMma to
handle the batch dimension introduced by CDNA3 block intrinsics.

Depends on #23949
Please review top commit only.

Add block intrinsic enum definitions and corresponding layout structures. These layouts have been e2e verified in a test PR iree-org#23934. They will be not added to the target details until we have correct heursitic support and selection policy for them. The plan for this is described in iree-org#23941 Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update GPUPackToIntrinsics and convertContractionToInnerTiledMma to handle the batch dimension introduced by CDNA3 block intrinsics. Depends on iree-org#23949 Please review top commit only. Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

krzysz00

One note - these didn't get introduced in CDNA3, last I checked. They've been around for quite a while, they just got renamed in CDNA3. Unless there's a particular set of intrinsics I don't know about.

Overall, I think these changes are good, and I don't really have any style complaints here.

nirvedhmeshram · 2026-03-30T20:01:15Z

@krzysz00 makes sense I just saw the names for these first appeared in CDNA3 so misunderstood, I dont see any block intrinsics for CDNA1 but CDNA2 has some name that goes like this v_mfma_f32_32x32x4bf16_1k. Do you think its ok to map them back to the renamed CDNA3 intrinsic and add it to CDNA2 as the new name rather than the old one?

krzysz00 · 2026-03-30T21:44:11Z

@nirvedhmeshram So there are two things that happened

A lot of instructions were renamed to explicitly mention how many blocks they took in CDNA3
CDNA3 did genuinely add some wider intrinsics, but I don't know if the block ones got widened at all. I know, for example, that the 4x4x4(16 blocks) has been around since CDNA1 - I'd check AMDGPUToROCDL for the exact architecture boundaries on various {block, m, n, k} combinations

nirvedhmeshram · 2026-03-30T21:50:19Z

@krzysz00 I dont see block intrinsics in CDNA1 isa here

krzysz00 · 2026-03-30T21:57:46Z

@nirvedhmeshram Oh, they're there! They're just hiding.

Consider, for example, 32x32x1_f32 , as converted to ROCDL in https://github.com/llvm/llvm-project/blob/74c42432252de25caa51deaa38458d734622a579/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L876 or the 4x4x4_i8 intrinsic in https://github.com/llvm/llvm-project/blob/74c42432252de25caa51deaa38458d734622a579/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L949

Notice how if you compute the number of input registers needed for the LHS or RHS across all threads (32 for the former case, (4 * 4 / 4 (bytes / reg)) == 4 for the latter) you get a number less than 64 - that is, there are lanes that don't have inputs to provide. Now, what are those lanes doing ... in MFMA, they're always providing additional blocks!

So that i32_4x4x4_i8 intrinsic in your picture is an intrinsic with 16 blocks - it's just not named like that

nirvedhmeshram and others added 2 commits March 27, 2026 19:21

nirvedhmeshram requested review from Groverkss, Max191, krzysz00, kuhar and qedawkins as code owners March 30, 2026 19:23

krzysz00 approved these changes Mar 30, 2026

View reviewed changes

nirvedhmeshram mentioned this pull request Mar 30, 2026

[GPU] Implement block intrinsics introduced in CDNA3 (1/6) #23949

Merged

nirvedhmeshram changed the title ~~[GPU] Implement block intrinsics introduced in CDNA3 (2/6)~~ [GPU] Implement CDNA block intrinsics (2/6) Mar 31, 2026

nirvedhmeshram merged commit 14acfac into iree-org:main Mar 31, 2026
63 checks passed

nirvedhmeshram mentioned this pull request Mar 31, 2026

[GPU] Add CDNA block intrinsics #23941

Open

6 tasks

nirvedhmeshram deleted the block_intrinsics_task2 branch March 31, 2026 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Implement CDNA block intrinsics (2/6)#23963

[GPU] Implement CDNA block intrinsics (2/6)#23963
nirvedhmeshram merged 2 commits intoiree-org:mainfrom
nirvedhmeshram:block_intrinsics_task2

nirvedhmeshram commented Mar 30, 2026

Uh oh!

krzysz00 left a comment

Uh oh!

nirvedhmeshram commented Mar 30, 2026

Uh oh!

krzysz00 commented Mar 30, 2026

Uh oh!

nirvedhmeshram commented Mar 30, 2026

Uh oh!

krzysz00 commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nirvedhmeshram commented Mar 30, 2026

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

nirvedhmeshram commented Mar 30, 2026

Uh oh!

krzysz00 commented Mar 30, 2026

Uh oh!

nirvedhmeshram commented Mar 30, 2026

Uh oh!

krzysz00 commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants