[GPU] Implement CDNA block intrinsics (2/6)#23963
[GPU] Implement CDNA block intrinsics (2/6)#23963nirvedhmeshram merged 2 commits intoiree-org:mainfrom
Conversation
Add block intrinsic enum definitions and corresponding layout structures. These layouts have been e2e verified in a test PR iree-org#23934. They will be not added to the target details until we have correct heursitic support and selection policy for them. The plan for this is described in iree-org#23941 Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update GPUPackToIntrinsics and convertContractionToInnerTiledMma to handle the batch dimension introduced by CDNA3 block intrinsics. Depends on iree-org#23949 Please review top commit only. Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
krzysz00
left a comment
There was a problem hiding this comment.
One note - these didn't get introduced in CDNA3, last I checked. They've been around for quite a while, they just got renamed in CDNA3. Unless there's a particular set of intrinsics I don't know about.
Overall, I think these changes are good, and I don't really have any style complaints here.
|
@krzysz00 makes sense I just saw the names for these first appeared in CDNA3 so misunderstood, I dont see any block intrinsics for CDNA1 but CDNA2 has some name that goes like this |
|
@nirvedhmeshram So there are two things that happened
|
|
@nirvedhmeshram Oh, they're there! They're just hiding. Consider, for example, 32x32x1_f32 , as converted to ROCDL in https://github.com/llvm/llvm-project/blob/74c42432252de25caa51deaa38458d734622a579/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L876 or the 4x4x4_i8 intrinsic in https://github.com/llvm/llvm-project/blob/74c42432252de25caa51deaa38458d734622a579/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L949 Notice how if you compute the number of input registers needed for the LHS or RHS across all threads (32 for the former case, (4 * 4 / 4 (bytes / reg)) == 4 for the latter) you get a number less than 64 - that is, there are lanes that don't have inputs to provide. Now, what are those lanes doing ... in MFMA, they're always providing additional blocks! So that i32_4x4x4_i8 intrinsic in your picture is an intrinsic with 16 blocks - it's just not named like that |

Update GPUPackToIntrinsics and convertContractionToInnerTiledMma to
handle the batch dimension introduced by CDNA3 block intrinsics.
Depends on #23949
Please review top commit only.