Skip to content

[GPU] Implement CDNA block intrinsics (4/6)#23977

Merged
nirvedhmeshram merged 1 commit intoiree-org:mainfrom
nirvedhmeshram:block_intrinsics_task4
Apr 1, 2026
Merged

[GPU] Implement CDNA block intrinsics (4/6)#23977
nirvedhmeshram merged 1 commit intoiree-org:mainfrom
nirvedhmeshram:block_intrinsics_task4

Conversation

@nirvedhmeshram
Copy link
Copy Markdown
Contributor

@nirvedhmeshram nirvedhmeshram commented Mar 31, 2026

Intrinsics with a single-element accumulator (e.g. 4x4 f64 with 4 blocks) require the acc to be extracted to a scalar before passing to amdgpu.mfma, and the scalar result broadcast back to the vector type. This is because otherwise we don't have a valid result type as per the op definition.

Part of #23941

Block intrinsics with a single-element accumulator (e.g. 4x4 with 16
blocks) require the acc to be extracted to a scalar before passing to
amdgpu.mfma, and the scalar result broadcast back to the vector type.

Part of iree-org#23941

Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nirvedhmeshram nirvedhmeshram merged commit 37dce38 into iree-org:main Apr 1, 2026
63 checks passed
@nirvedhmeshram nirvedhmeshram deleted the block_intrinsics_task4 branch April 3, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants