-
Notifications
You must be signed in to change notification settings - Fork 2
perf: optimize memory allocations in solve system hot paths #73
Copy link
Copy link
Closed
Description
Summary
Performance profiling of the solve system identified several hot paths with unnecessary memory allocations that could be optimized.
Optimization Opportunities
1. src/solve/forward_cache.jl - Per-Stencil Allocations (Lines 54-64)
local_data = [data[i] for i in neighbors] # Allocates vector each iteration
A_full = zeros(TD, n, n) # Allocates fresh matrix
b = zeros(TD, n, num_ops) # Allocates fresh RHSFix: Pre-allocate buffers outside the evaluation loop and reuse with fill!.
2. src/solve/execution.jl - Kernel Matrix Allocation (Lines 235-238)
@kernel function weight_kernel(...)
for eval_idx in start_idx:end_idx
n = k + nmon
A = Symmetric(zeros(TD, n, n), :U) # Allocates per eval point!
b = _prepare_buffer(ℒrbf, TD, n)Fix: Hoist allocations outside the loop; reuse buffers across evaluation points.
3. src/solve/forward_cache.jl - Dense Matrix Copies (Lines 81-87)
# Explicitly filling lower triangle
for j in 1:n
for i in (j + 1):n
A_full_symmetric[i, j] = A_full[j, i] # Redundant O(n²) copy
end
end
stencil_caches[eval_idx] = StencilForwardCache(copy(λ), A_full_symmetric, k, nmon)Fix: Use Symmetric(A_full, :U) view instead of explicit copy. Consider if full matrix storage is necessary.
4. src/interpolation.jl - Scalar Loop Instead of BLAS (Lines 36-50)
for i in eachindex(rbfi.rbf_weights)
rbf += rbfi.rbf_weights[i] * rbfi.rbf_basis(x, rbfi.x[i]) # Scalar accumulation
endFix: Pre-compute basis evaluations into a vector, then use dot(rbfi.rbf_weights, basis_vals) for BLAS acceleration.
Expected Impact
- Reduced GC pressure during weight computation
- Better cache locality from buffer reuse
- Potential 2-5x speedup for large stencil sizes (k > 30)
Related
This follows the BLAS optimizations added in commit 4c64025 for the backward pass.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels