FD Operator CUDAExt by imreddyTeja · Pull Request #2466 · CliMA/ClimaCore.jl

imreddyTeja · 2026-03-10T21:26:30Z

TODO before merge:

Generic Nv support
Squash commits
Delete old copyto_stencil_64
rename files and kernel
add shmem support to auto_launch
Code follows the style guidelines OR N/A.
Unit tests are included OR N/A.
Code is exercised in an integration test OR N/A.
Documentation has been added/updated OR N/A.

dennisYatunin · 2026-03-20T22:24:23Z

        arg1_isa_matrix =
-            eltype(arg1) <: BandMatrixRow || arg1 isa LazyOperatorBroadcasted
+            eltype(arg1) <: BandMatrixRow || (arg1 isa LazyOperatorBroadcasted)
+        if arg1 isa LazyOperatorBroadcasted && length(arg1.args) > 0
+            arg1_isa_matrix =
+                eltype(arg1.args[1]) <: BandMatrixRow ||
+                arg1.args[1] isa LazyOperatorBroadcasted
+        end


I've seen this sort of conditional variable updating hurt inference inside recursive calls. Better to have something like

arg1_isa_matrix = arg1 isa LazyOperatorBroadcasted && length(arg1.args) > 0 ? eltype(arg1.args[1]) <: BandMatrixRow || arg1.args[1] isa LazyOperatorBroadcasted : eltype(arg1) <: BandMatrixRow || arg1 isa LazyOperatorBroadcasted

But also, this looks like it should be defined recursively? Only going down one level into arg1.args feels a bit arbitrary.

dennisYatunin · 2026-03-20T22:26:02Z

    end
 end

+# TODO: move into CUDAExt


This could also live in Geometry/rmul_with_projection.jl for now

I think that would make a dependency loop because MatrixFields depends on Geometry already

dennisYatunin · 2026-03-20T22:31:29Z

+        else
+            return Operators.return_eltype(matrix1.op.op, matrix1.args[1], arg)
+        end
+    end


Does the call to rmul_return_type below not generate the same result? I don't see why this new branch is needed.

Not in the case with divgrad of a vec

dennisYatunin · 2026-03-24T02:38:25Z

+        @inline @inbounds project_row2_for_mul(mat1_row, mat2_row, mat2_space)
+    # It should be possible to use static shared memory here, but it allocates new shared memory
+    # for each layer of recursion
+    CUDA.sync_threads()


What is the purpose of this first synchronization? The second one is to ensure that every level sees the same values in mat2, but there are no matrix values being synchronized here.

To ensure that any potential shmem use for the recursion is complete

dennisYatunin · 2026-03-24T02:55:05Z

+    project_onto =
+        ClimaCore.Geometry.recursively_find_dual_axes_for_projection(typeof(mat1_row))
+    if space.staggering isa Spaces.CellCenter && v == Int32(64)
+        lg = rzero(Spaces.local_geometry_type(typeof(space)))


Suggested change

lg = rzero(Spaces.local_geometry_type(typeof(space)))

lg = new_struct(Spaces.local_geometry_type(typeof(space)))

You can avoid all these calls to rzero (which come which a decently large latency penalty) by using something like

@generated new_struct(::Type{T}) where {T} = Expr(:new, :T)

dennisYatunin · 2026-03-24T03:26:50Z

+# row_mul_vec! handles banded matrix * vector. There are four methods, but they all have the
+# same structure, so we they could be written as a single method.
+# The others can be obtained by copy-pasting and changing the indices appropriately.
+# Note that these are all specialized for 64 faces , so the indices are hardcoded.
+Base.@propagate_inbounds function row_mul_vec!(
+    ::Type{P},
+    mat1_row,
+    matrix2,
+    ::FaceToCenter,
+) where {P}
+    @inbounds begin
+        prod_eltype = P
+        v = threadIdx().x
+        i = threadIdx().y
+        mat1_eltype = typeof(mat1_row)
+        mat2_eltype = eltype(matrix2)
+        ld1, ud1 = MatrixFields.outer_diagonals(mat1_eltype)
+        li = Int32(1)
+        ri = Int32(63)
+        zero_entry = rzero(prod_eltype)
+        return UnrolledUtilities.unrolled_mapreduce(
+            ⊞,
+            ld1:ud1;
+            init = zero_entry,
+        ) do mat1_row_d
+            if (Int32(0) < v + mat1_row_d + half <= Int32(64))
+                @inbounds outer_or_mul(mat1_row[mat1_row_d], matrix2[v + mat1_row_d + half+ (i - Int32(1)) * Int32(64)])
+            else
+                zero_entry
+            end
+        end
+    end
+end


Suggested change

# row_mul_vec! handles banded matrix * vector. There are four methods, but they all have the

# same structure, so we they could be written as a single method.

# The others can be obtained by copy-pasting and changing the indices appropriately.

# Note that these are all specialized for 64 faces , so the indices are hardcoded.

Base.@propagate_inbounds function row_mul_vec!(

::Type{P},

mat1_row,

matrix2,

::FaceToCenter,

) where {P}

@inbounds begin

prod_eltype = P

v = threadIdx().x

i = threadIdx().y

mat1_eltype = typeof(mat1_row)

mat2_eltype = eltype(matrix2)

ld1, ud1 = MatrixFields.outer_diagonals(mat1_eltype)

li = Int32(1)

ri = Int32(63)

zero_entry = rzero(prod_eltype)

return UnrolledUtilities.unrolled_mapreduce(

⊞,

ld1:ud1;

init = zero_entry,

) do mat1_row_d

if (Int32(0) < v + mat1_row_d + half <= Int32(64))

@inbounds outer_or_mul(mat1_row[mat1_row_d], matrix2[v + mat1_row_d + half+ (i - Int32(1)) * Int32(64)])

else

zero_entry

end

end

end

end

@inline function row_mul_vec!(::Type{P}, mat_row, vec, shape) where {P}

v_mat = threadIdx().x

i = threadIdx().y

zero_entry = rzero(P)

ld, ud = MatrixFields.outer_diagonals(typeof(mat_row))

d_offset = shape == FaceToCenter() ? half : shape == CenterToFace() ? -half : 0

return UnrolledUtilities.unrolled_mapreduce(⊞, ld:ud; init = zero_entry) do d

v_vec = v_mat + d + d_offset

Int32(1) <= v_vec <= Spaces.nlevels(axes(vec)) || return zero_entry

@inbounds outer_or_mul(mat_row[d], vec[v_vec + (i - Int32(1)) * Int32(64)])

end

end

I think this method covers all 4 cases of matrix-vector multiplication. And it should be straightforward to extend to matrix-matrix multiplication, letting you get rid of all the code duplication in this file.

Add gpu support to column_convergence.jl and unit_column.jl

rename new_entry cleanup test cleanup enable test frmt renaming Add back inbounds

more exclusions

imreddyTeja force-pushed the tr/matmul branch from 7ace185 to 0068b3d Compare March 11, 2026 17:14

imreddyTeja requested a review from dennisYatunin March 17, 2026 17:47

dennisYatunin approved these changes Mar 24, 2026

View reviewed changes

imreddyTeja force-pushed the tr/matmul branch from c4b11a4 to dd95a2d Compare March 24, 2026 16:38

Add shared memory stencil support in CUDAExt

50ed5af

imreddyTeja force-pushed the tr/matmul branch 2 times, most recently from 17c2e6c to 743c371 Compare March 24, 2026 22:18

imreddyTeja added 2 commits March 25, 2026 08:26

Add gpu support to fd column tests

da72694

Add gpu support to column_convergence.jl and unit_column.jl

Add shmem to auto_launch

c50a7ce

imreddyTeja force-pushed the tr/matmul branch 2 times, most recently from 6ce8a50 to be7c8c4 Compare March 25, 2026 16:58

imreddyTeja marked this pull request as ready for review March 25, 2026 18:15

imreddyTeja force-pushed the tr/matmul branch 4 times, most recently from 0f340b4 to 4f7f19b Compare March 25, 2026 23:20

imreddyTeja enabled auto-merge (rebase) March 26, 2026 00:00

Format and clean up shmem stencil support

718b204

rename new_entry cleanup test cleanup enable test frmt renaming Add back inbounds

imreddyTeja force-pushed the tr/matmul branch 7 times, most recently from 5c0316c to 1fa3401 Compare March 27, 2026 16:30

imreddyTeja added 2 commits March 27, 2026 09:32

Exclude cuda ext from jet

98bfebc

more exclusions

Don't limit recursion for fd eager funcs

9771a1f

imreddyTeja force-pushed the tr/matmul branch from 1fa3401 to 9771a1f Compare March 27, 2026 16:32

Use getidx for MultiplyColumnwiseBandMatrixField with large elems

c8f6d83

imreddyTeja force-pushed the tr/matmul branch from 7ce9e54 to c8f6d83 Compare March 27, 2026 21:35

imreddyTeja force-pushed the tr/matmul branch 2 times, most recently from e3dbde6 to ee17c08 Compare March 30, 2026 18:45

imreddyTeja disabled auto-merge March 30, 2026 20:20

Fix infinite recursion in eager fd operator eval

0200294

imreddyTeja force-pushed the tr/matmul branch from ee17c08 to 0200294 Compare March 30, 2026 23:17

imreddyTeja merged commit 5143dee into main Mar 31, 2026
34 of 36 checks passed

imreddyTeja deleted the tr/matmul branch March 31, 2026 22:27

dennisYatunin mentioned this pull request Apr 3, 2026

Add support for nonuniform data structures #2464

Merged

4 tasks

imreddyTeja mentioned this pull request Apr 6, 2026

Refactor Matrix multiplication #2418

Open

dennisYatunin mentioned this pull request Apr 6, 2026

ClimaCore refactoring plan for discussion #2468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FD Operator CUDAExt#2466

FD Operator CUDAExt#2466
imreddyTeja merged 8 commits intomainfrom
tr/matmul

imreddyTeja commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

dennisYatunin Mar 20, 2026

Uh oh!

dennisYatunin Mar 20, 2026

Uh oh!

imreddyTeja Mar 24, 2026

Uh oh!

dennisYatunin Mar 20, 2026

Uh oh!

imreddyTeja Mar 24, 2026

Uh oh!

Uh oh!

dennisYatunin Mar 24, 2026

Uh oh!

imreddyTeja Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

dennisYatunin Mar 24, 2026

Uh oh!

dennisYatunin Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	lg = rzero(Spaces.local_geometry_type(typeof(space)))
	lg = new_struct(Spaces.local_geometry_type(typeof(space)))

Conversation

imreddyTeja commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

imreddyTeja commented Mar 10, 2026 •

edited

Loading