Conversation
| arg1_isa_matrix = | ||
| eltype(arg1) <: BandMatrixRow || arg1 isa LazyOperatorBroadcasted | ||
| eltype(arg1) <: BandMatrixRow || (arg1 isa LazyOperatorBroadcasted) | ||
| if arg1 isa LazyOperatorBroadcasted && length(arg1.args) > 0 | ||
| arg1_isa_matrix = | ||
| eltype(arg1.args[1]) <: BandMatrixRow || | ||
| arg1.args[1] isa LazyOperatorBroadcasted | ||
| end |
There was a problem hiding this comment.
I've seen this sort of conditional variable updating hurt inference inside recursive calls. Better to have something like
arg1_isa_matrix =
arg1 isa LazyOperatorBroadcasted && length(arg1.args) > 0 ?
eltype(arg1.args[1]) <: BandMatrixRow || arg1.args[1] isa LazyOperatorBroadcasted :
eltype(arg1) <: BandMatrixRow || arg1 isa LazyOperatorBroadcastedBut also, this looks like it should be defined recursively? Only going down one level into arg1.args feels a bit arbitrary.
| end | ||
| end | ||
|
|
||
| # TODO: move into CUDAExt |
There was a problem hiding this comment.
This could also live in Geometry/rmul_with_projection.jl for now
There was a problem hiding this comment.
I think that would make a dependency loop because MatrixFields depends on Geometry already
| else | ||
| return Operators.return_eltype(matrix1.op.op, matrix1.args[1], arg) | ||
| end | ||
| end |
There was a problem hiding this comment.
Does the call to rmul_return_type below not generate the same result? I don't see why this new branch is needed.
There was a problem hiding this comment.
Not in the case with divgrad of a vec
| @inline @inbounds project_row2_for_mul(mat1_row, mat2_row, mat2_space) | ||
| # It should be possible to use static shared memory here, but it allocates new shared memory | ||
| # for each layer of recursion | ||
| CUDA.sync_threads() |
There was a problem hiding this comment.
What is the purpose of this first synchronization? The second one is to ensure that every level sees the same values in mat2, but there are no matrix values being synchronized here.
There was a problem hiding this comment.
To ensure that any potential shmem use for the recursion is complete
| project_onto = | ||
| ClimaCore.Geometry.recursively_find_dual_axes_for_projection(typeof(mat1_row)) | ||
| if space.staggering isa Spaces.CellCenter && v == Int32(64) | ||
| lg = rzero(Spaces.local_geometry_type(typeof(space))) |
There was a problem hiding this comment.
| lg = rzero(Spaces.local_geometry_type(typeof(space))) | |
| lg = new_struct(Spaces.local_geometry_type(typeof(space))) |
You can avoid all these calls to rzero (which come which a decently large latency penalty) by using something like
@generated new_struct(::Type{T}) where {T} = Expr(:new, :T)| # row_mul_vec! handles banded matrix * vector. There are four methods, but they all have the | ||
| # same structure, so we they could be written as a single method. | ||
| # The others can be obtained by copy-pasting and changing the indices appropriately. | ||
| # Note that these are all specialized for 64 faces , so the indices are hardcoded. | ||
| Base.@propagate_inbounds function row_mul_vec!( | ||
| ::Type{P}, | ||
| mat1_row, | ||
| matrix2, | ||
| ::FaceToCenter, | ||
| ) where {P} | ||
| @inbounds begin | ||
| prod_eltype = P | ||
| v = threadIdx().x | ||
| i = threadIdx().y | ||
| mat1_eltype = typeof(mat1_row) | ||
| mat2_eltype = eltype(matrix2) | ||
| ld1, ud1 = MatrixFields.outer_diagonals(mat1_eltype) | ||
| li = Int32(1) | ||
| ri = Int32(63) | ||
| zero_entry = rzero(prod_eltype) | ||
| return UnrolledUtilities.unrolled_mapreduce( | ||
| ⊞, | ||
| ld1:ud1; | ||
| init = zero_entry, | ||
| ) do mat1_row_d | ||
| if (Int32(0) < v + mat1_row_d + half <= Int32(64)) | ||
| @inbounds outer_or_mul(mat1_row[mat1_row_d], matrix2[v + mat1_row_d + half+ (i - Int32(1)) * Int32(64)]) | ||
| else | ||
| zero_entry | ||
| end | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
| # row_mul_vec! handles banded matrix * vector. There are four methods, but they all have the | |
| # same structure, so we they could be written as a single method. | |
| # The others can be obtained by copy-pasting and changing the indices appropriately. | |
| # Note that these are all specialized for 64 faces , so the indices are hardcoded. | |
| Base.@propagate_inbounds function row_mul_vec!( | |
| ::Type{P}, | |
| mat1_row, | |
| matrix2, | |
| ::FaceToCenter, | |
| ) where {P} | |
| @inbounds begin | |
| prod_eltype = P | |
| v = threadIdx().x | |
| i = threadIdx().y | |
| mat1_eltype = typeof(mat1_row) | |
| mat2_eltype = eltype(matrix2) | |
| ld1, ud1 = MatrixFields.outer_diagonals(mat1_eltype) | |
| li = Int32(1) | |
| ri = Int32(63) | |
| zero_entry = rzero(prod_eltype) | |
| return UnrolledUtilities.unrolled_mapreduce( | |
| ⊞, | |
| ld1:ud1; | |
| init = zero_entry, | |
| ) do mat1_row_d | |
| if (Int32(0) < v + mat1_row_d + half <= Int32(64)) | |
| @inbounds outer_or_mul(mat1_row[mat1_row_d], matrix2[v + mat1_row_d + half+ (i - Int32(1)) * Int32(64)]) | |
| else | |
| zero_entry | |
| end | |
| end | |
| end | |
| end | |
| @inline function row_mul_vec!(::Type{P}, mat_row, vec, shape) where {P} | |
| v_mat = threadIdx().x | |
| i = threadIdx().y | |
| zero_entry = rzero(P) | |
| ld, ud = MatrixFields.outer_diagonals(typeof(mat_row)) | |
| d_offset = shape == FaceToCenter() ? half : shape == CenterToFace() ? -half : 0 | |
| return UnrolledUtilities.unrolled_mapreduce(⊞, ld:ud; init = zero_entry) do d | |
| v_vec = v_mat + d + d_offset | |
| Int32(1) <= v_vec <= Spaces.nlevels(axes(vec)) || return zero_entry | |
| @inbounds outer_or_mul(mat_row[d], vec[v_vec + (i - Int32(1)) * Int32(64)]) | |
| end | |
| end |
I think this method covers all 4 cases of matrix-vector multiplication. And it should be straightforward to extend to matrix-matrix multiplication, letting you get rid of all the code duplication in this file.
17c2e6c to
743c371
Compare
Add gpu support to column_convergence.jl and unit_column.jl
6ce8a50 to
be7c8c4
Compare
0f340b4 to
4f7f19b
Compare
rename new_entry cleanup test cleanup enable test frmt renaming Add back inbounds
5c0316c to
1fa3401
Compare
e3dbde6 to
ee17c08
Compare
TODO before merge:
Generic Nv support
Squash commits
Delete old copyto_stencil_64
rename files and kernel
add shmem support to auto_launch
Code follows the style guidelines OR N/A.
Unit tests are included OR N/A.
Code is exercised in an integration test OR N/A.
Documentation has been added/updated OR N/A.