Skip to content

Add space-based CUDA kernel fusion for copyto#2482

Draft
petebachant wants to merge 12 commits intomainfrom
pb/fieldname-set
Draft

Add space-based CUDA kernel fusion for copyto#2482
petebachant wants to merge 12 commits intomainfrom
pb/fieldname-set

Conversation

@petebachant
Copy link
Copy Markdown
Member

This is a 13% bump in SYPD for the prog EDMF 1M Atmos config (no land).

Kernel analysis:

image

@petebachant petebachant moved this to In review in Performance Apr 6, 2026
@petebachant petebachant moved this from In review to In progress in Performance Apr 6, 2026
@petebachant petebachant marked this pull request as draft April 6, 2026 17:02
@petebachant petebachant marked this pull request as draft April 6, 2026 17:02
@petebachant
Copy link
Copy Markdown
Member Author

Interestingly this shows no improvement when running the prog EDMF 1M config in the coupler: https://buildkite.com/clima/climacore-end-to-end-performance/builds/142/steps/canvas?sid=019d6395-829b-4900-93c5-cea1aca53baf&tab=output

@petebachant
Copy link
Copy Markdown
Member Author

@dennisYatunin this may be interesting to you. When running ClimaAtmos on its own, this change speeds things up significantly, but when run with the coupler with the same exact Atmos config, we hit the fallback condition, i.e., the compiler fails to fuse the kernels.

@dennisYatunin
Copy link
Copy Markdown
Member

dennisYatunin commented Apr 10, 2026

Wow, yeah, that is very interesting! Running the same function from ClimaCoupler should only add a couple of stack frames on top of running it in ClimaAtmos, but I suppose this shows that your example is right on the edge of triggering a compiler heuristic and de-optimizing.

I think the simplest way to avoid this would be to directly call the ClimaAtmos implicit solver from the AMIP driver, forcing it to be compiled efficiently before it gets called inside ClimaCoupler's deeper stacktrace. That's definitely not a pattern we want to use frequently, and understanding how to avoid these compiler heuristics altogether would be a much more sustainable solution. But at least while we're still figuring things out you can use it, especially if that leads to such a big performance improvement.

@petebachant petebachant self-assigned this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants