perf: optimize memory allocations in solve system hot paths

## Summary

Performance profiling of the solve system identified several hot paths with unnecessary memory allocations that could be optimized.

## Optimization Opportunities

### 1. `src/solve/forward_cache.jl` - Per-Stencil Allocations (Lines 54-64)

```julia
local_data = [data[i] for i in neighbors]  # Allocates vector each iteration
A_full = zeros(TD, n, n)                   # Allocates fresh matrix
b = zeros(TD, n, num_ops)                  # Allocates fresh RHS
```

**Fix**: Pre-allocate buffers outside the evaluation loop and reuse with `fill!`.

### 2. `src/solve/execution.jl` - Kernel Matrix Allocation (Lines 235-238)

```julia
@kernel function weight_kernel(...)
    for eval_idx in start_idx:end_idx
        n = k + nmon
        A = Symmetric(zeros(TD, n, n), :U)  # Allocates per eval point!
        b = _prepare_buffer(ℒrbf, TD, n)
```

**Fix**: Hoist allocations outside the loop; reuse buffers across evaluation points.

### 3. `src/solve/forward_cache.jl` - Dense Matrix Copies (Lines 81-87)

```julia
# Explicitly filling lower triangle
for j in 1:n
    for i in (j + 1):n
        A_full_symmetric[i, j] = A_full[j, i]  # Redundant O(n²) copy
    end
end
stencil_caches[eval_idx] = StencilForwardCache(copy(λ), A_full_symmetric, k, nmon)
```

**Fix**: Use `Symmetric(A_full, :U)` view instead of explicit copy. Consider if full matrix storage is necessary.

### 4. `src/interpolation.jl` - Scalar Loop Instead of BLAS (Lines 36-50)

```julia
for i in eachindex(rbfi.rbf_weights)
    rbf += rbfi.rbf_weights[i] * rbfi.rbf_basis(x, rbfi.x[i])  # Scalar accumulation
end
```

**Fix**: Pre-compute basis evaluations into a vector, then use `dot(rbfi.rbf_weights, basis_vals)` for BLAS acceleration.

## Expected Impact

- Reduced GC pressure during weight computation
- Better cache locality from buffer reuse
- Potential 2-5x speedup for large stencil sizes (k > 30)

## Related

This follows the BLAS optimizations added in commit 4c64025 for the backward pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize memory allocations in solve system hot paths #73

Summary

Optimization Opportunities

1. `src/solve/forward_cache.jl` - Per-Stencil Allocations (Lines 54-64)

2. `src/solve/execution.jl` - Kernel Matrix Allocation (Lines 235-238)

3. `src/solve/forward_cache.jl` - Dense Matrix Copies (Lines 81-87)

4. `src/interpolation.jl` - Scalar Loop Instead of BLAS (Lines 36-50)

Expected Impact

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: optimize memory allocations in solve system hot paths #73

Description

Summary

Optimization Opportunities

1. src/solve/forward_cache.jl - Per-Stencil Allocations (Lines 54-64)

2. src/solve/execution.jl - Kernel Matrix Allocation (Lines 235-238)

3. src/solve/forward_cache.jl - Dense Matrix Copies (Lines 81-87)

4. src/interpolation.jl - Scalar Loop Instead of BLAS (Lines 36-50)

Expected Impact

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `src/solve/forward_cache.jl` - Per-Stencil Allocations (Lines 54-64)

2. `src/solve/execution.jl` - Kernel Matrix Allocation (Lines 235-238)

3. `src/solve/forward_cache.jl` - Dense Matrix Copies (Lines 81-87)

4. `src/interpolation.jl` - Scalar Loop Instead of BLAS (Lines 36-50)