Add space-based CUDA kernel fusion for copyto#2482
Add space-based CUDA kernel fusion for copyto#2482petebachant wants to merge 12 commits intomainfrom
copyto#2482Conversation
|
Interestingly this shows no improvement when running the prog EDMF 1M config in the coupler: https://buildkite.com/clima/climacore-end-to-end-performance/builds/142/steps/canvas?sid=019d6395-829b-4900-93c5-cea1aca53baf&tab=output |
|
@dennisYatunin this may be interesting to you. When running ClimaAtmos on its own, this change speeds things up significantly, but when run with the coupler with the same exact Atmos config, we hit the fallback condition, i.e., the compiler fails to fuse the kernels. |
|
Wow, yeah, that is very interesting! Running the same function from ClimaCoupler should only add a couple of stack frames on top of running it in ClimaAtmos, but I suppose this shows that your example is right on the edge of triggering a compiler heuristic and de-optimizing. I think the simplest way to avoid this would be to directly call the ClimaAtmos implicit solver from the AMIP driver, forcing it to be compiled efficiently before it gets called inside ClimaCoupler's deeper stacktrace. That's definitely not a pattern we want to use frequently, and understanding how to avoid these compiler heuristics altogether would be a much more sustainable solution. But at least while we're still figuring things out you can use it, especially if that leads to such a big performance improvement. |
This is a 13% bump in SYPD for the prog EDMF 1M Atmos config (no land).
Kernel analysis: