Skip to content

[GPU] Implement CDNA block intrinsics (2/6)#23963

Merged
nirvedhmeshram merged 2 commits intoiree-org:mainfrom
nirvedhmeshram:block_intrinsics_task2
Mar 31, 2026
Merged

[GPU] Implement CDNA block intrinsics (2/6)#23963
nirvedhmeshram merged 2 commits intoiree-org:mainfrom
nirvedhmeshram:block_intrinsics_task2

Conversation

@nirvedhmeshram
Copy link
Copy Markdown
Contributor

Update GPUPackToIntrinsics and convertContractionToInnerTiledMma to
handle the batch dimension introduced by CDNA3 block intrinsics.

Depends on #23949
Please review top commit only.

nirvedhmeshram and others added 2 commits March 27, 2026 19:21
Add block intrinsic enum definitions and corresponding layout structures.
These layouts have been e2e verified in a test PR
iree-org#23934.
They will be not added to the target details until we have correct
heursitic support and selection policy for them. The plan for this is
described in iree-org#23941

Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update GPUPackToIntrinsics and convertContractionToInnerTiledMma to
handle the batch dimension introduced by CDNA3 block intrinsics.

Depends on iree-org#23949
Please review top commit only.
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note - these didn't get introduced in CDNA3, last I checked. They've been around for quite a while, they just got renamed in CDNA3. Unless there's a particular set of intrinsics I don't know about.

Overall, I think these changes are good, and I don't really have any style complaints here.

@nirvedhmeshram
Copy link
Copy Markdown
Contributor Author

@krzysz00 makes sense I just saw the names for these first appeared in CDNA3 so misunderstood, I dont see any block intrinsics for CDNA1 but CDNA2 has some name that goes like this v_mfma_f32_32x32x4bf16_1k. Do you think its ok to map them back to the renamed CDNA3 intrinsic and add it to CDNA2 as the new name rather than the old one?

@krzysz00
Copy link
Copy Markdown
Contributor

@nirvedhmeshram So there are two things that happened

  1. A lot of instructions were renamed to explicitly mention how many blocks they took in CDNA3
  2. CDNA3 did genuinely add some wider intrinsics, but I don't know if the block ones got widened at all. I know, for example, that the 4x4x4(16 blocks) has been around since CDNA1 - I'd check AMDGPUToROCDL for the exact architecture boundaries on various {block, m, n, k} combinations

@nirvedhmeshram
Copy link
Copy Markdown
Contributor Author

@krzysz00 I dont see block intrinsics in CDNA1 isa here

image

@krzysz00
Copy link
Copy Markdown
Contributor

@nirvedhmeshram Oh, they're there! They're just hiding.

Consider, for example, 32x32x1_f32 , as converted to ROCDL in https://github.com/llvm/llvm-project/blob/74c42432252de25caa51deaa38458d734622a579/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L876 or the 4x4x4_i8 intrinsic in https://github.com/llvm/llvm-project/blob/74c42432252de25caa51deaa38458d734622a579/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L949

Notice how if you compute the number of input registers needed for the LHS or RHS across all threads (32 for the former case, (4 * 4 / 4 (bytes / reg)) == 4 for the latter) you get a number less than 64 - that is, there are lanes that don't have inputs to provide. Now, what are those lanes doing ... in MFMA, they're always providing additional blocks!

So that i32_4x4x4_i8 intrinsic in your picture is an intrinsic with 16 blocks - it's just not named like that

@nirvedhmeshram nirvedhmeshram changed the title [GPU] Implement block intrinsics introduced in CDNA3 (2/6) [GPU] Implement CDNA block intrinsics (2/6) Mar 31, 2026
@nirvedhmeshram nirvedhmeshram merged commit 14acfac into iree-org:main Mar 31, 2026
63 checks passed
@nirvedhmeshram nirvedhmeshram deleted the block_intrinsics_task2 branch March 31, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants