proto(cubecl): GPU resize + ML preprocessing kernels — 2.4-3.5× faster than NVIDIA VPI on Jetson Orin#897
Draft
proto(cubecl): GPU resize + ML preprocessing kernels — 2.4-3.5× faster than NVIDIA VPI on Jetson Orin#897
Conversation
Design for a new kornia-cubecl crate that prototypes a bilinear u8 RGB 2x downscale kernel and benchmarks cubecl-cuda + cubecl-cpu against the production NEON path (fast_image_resize) on Jetson Orin.
Bilinear u8 RGB 2x downscale kernel using cubecl 0.10-pre.4 (cuda runtime), with weight precompute, public dispatch, correctness test vs fast_image_resize NEON path, and Criterion benchmark with 5 arms across 4 sizes. Lives as a sub-workspace to avoid cubecl-cuda's large dep tree forcing re-resolution of the parent workspace's brittle rerun pinning. cubecl-cpu support is gated behind --features cpu (requires tracel-llvm-20.1.4-7 prebuilt bundle, manually patched on this Jetson due to upstream dir-name mislabel in the v20.1.4-7 release).
…o cpu feature cubecl-cuda 0.10-pre.4 (via cudarc 0.19) calls cuCoredumpDeregisterCompleteCallback which requires CUDA 12.3+; Jetson Orin's libcuda.so is older and panics on first allocation. Default to cpu feature so cargo build works out of the box on this hardware. cuda feature still buildable when target environment supports it. Also fixed read_one() arg type (takes Handle not Binding) in test + bench, and swapped block_on import path (cubecl::future, not cubecl::common::future).
Adds standalone std::time bench (examples/bench_min.rs) that bypasses criterion's heavy release-mode dep tree, plus RESULTS.md with the measurement table and analysis. cubecl-cpu kernel matches fast_image_resize NEON output bit-exactly (max_diff=0) but is 9-119x slower across the 512^2 → 4096^2 size sweep on Jetson Orin's CPU. cubecl-cuda arm blocked by libcuda.so missing cuCoredumpDeregister- CompleteCallback (CUDA 12.3 symbol, cudarc 0.19 expects it).
Adds 8192² (4K output) and 1920×1080 (typical ML preprocessing) sizes. Headline change: cubecl-cpu kernel throughput is still ramping at the largest size tested (145 Mpix/s at 4K out, up from 100 at 2K), suggesting asymptotic peak of 200-300 Mpix/s vs NEON's 1100-1650 Mpix/s ceiling. Real compute gap is 5-6×, not 9-119× — small-input numbers were dominated by per-call dispatch overhead.
Adds resize_bilinear_u8_rgb_kernel_x4 and _x16 variants that process 4 or 16 dst pixels per thread, reducing total thread count and amortizing cubecl-cpu's per-thread dispatch overhead. At 8192²→4096² downscale, kernel throughput goes from 154 → 308 Mpix/s (x16 variant), closing the gap to NEON from 8× to 4×. Optimal tile size is not monotonic in input size: x4 wins at 2048², x16 at 4096² and 1080p.
Unblocked cubecl-cuda on Jetson by setting CUDARC_CUDA_VERSION=12060 before build (forces cudarc to bind only CUDA 12.6 symbols; without it, build.rs falls back to cuda-13020 latest which dlsyms cuCoredumpDeregisterCompleteCallback, a CUDA 13.2 symbol absent from Jetson libcuda). Results: cubecl_cuda_kernel hits 2316-2984 Mpix/s vs NEON's 678-1208 across 1024² → 8192² inputs; 1080p→540p ML preprocessing case is 2.9x faster on cuda kernel. End-to-end cuda is dominated by cudaMemcpy though — Tegra unified memory wasted by cudarc's explicit copies. Also: x16 tile variant is faster on CPU (fewer threads = less overhead), slower on GPU (fewer threads = lower occupancy). Same kernel, opposite optima per backend.
Adds resize_bilinear_u8_rgb_with_weights + WeightHandles struct that caches the four small weight-buffer uploads across calls. Material at small sizes where per-dispatch overhead dominates: 256² out goes 500 → 1100 Mpix/s (2.2x). Head-to-head vs NVIDIA VPI 3.2.4 on the SAME Jetson Orin Nano (not extrapolated from AGX Orin docs): size cubecl_cuda_pw VPI cuda vs VPI 256² out 1100 42 26.2x 512² out 2646 165 16.0x 1024² out 3098 526 5.9x 1080p→540p 1710 593 2.9x 2048² out 2918 1619 1.8x 4096² out 3418 2566 1.3x Bench script: crates/kornia-cubecl/examples/vpi_bench.py runs the VPI side. Same inputs, same timing methodology (warmup + 10-rep median). cubecl beats VPI at every size, hits the 10x goal at 256² and 512² output sizes. At largest sizes both are DRAM-bandwidth-bound and converge.
Two new variants: - resize_bilinear_u8_rgb_x4_with_weights: combines 4-pixel-per-thread tiling with pre-uploaded weights. Best for 2048² out (2727 Mpix/s). - resize_bilinear_u8_rgb_with_weights_wide: 32×8 workgroup instead of 16×16. Best for non-square inputs and large sizes. Updated VPI head-to-head (best variant per size on Jetson Orin Nano): 256² out: 922 Mpix/s 22.0× VPI 512² out: 2347 14.2× VPI 1024² out: 3154 6.0× VPI 1080p→540p: 2905 4.9× VPI (was 2.9×, now beats by 4.9) 2048² out: 2727 1.7× VPI 4096² out: 3546 1.4× VPI — 85% of DRAM peak, ceiling The 10× target is achieved at 256² and 512² output. Above 1024² out both implementations converge toward the 68 GB/s LPDDR5 bandwidth ceiling and the gap is fundamentally hardware-limited.
Adds a two-tier API so the same source supports both standalone ops and
fused pipelines:
Tier 1 — #[cube] primitives (inlined at codegen):
sample_bilinear_u8_rgb_pixel(...)
rgb_to_gray_u8(r, g, b)
normalize_u8_to_f32(g, mean, inv_std)
Tier 2 — standalone launchers (one kernel each):
rgb_to_gray_u8<R>, normalize_u8_to_f32<R>, resize_bilinear_u8_rgb<R>
Tier 3 — pre-fused common pipelines (one kernel, primitives called inline):
resize_to_gray_normalize_with_weights<R>
Bench (bilinear resize → rgb→gray → normalize_to_f32 on Jetson Orin Nano):
size sequential fused speedup
1024² out 1429 2766 1.94x
1080p → 540p 1114 2890 2.59x
2048² out 1149 3158 2.75x
4096² out 1648 3218 1.95x
4096² out (8K) 1856 3589 1.93x
The 2x matches the theoretical max for a 3-op chain (sequential reads+writes
~96 MB of intermediates per call; fused eliminates them). The 8K fused result
(3589 Mpix/s) slightly beats the standalone resize peak (3546 Mpix/s) because
the f32 gray output is smaller total memory than RGB.
API design: primitives are #[cube] (not #[cube(launch)]) so they inline when
composed. Callers writing a custom pipeline define their own #[cube(launch)]
kernel that calls the primitives in sequence — cubecl emits one CUDA kernel.
…preprocessing
Adds the canonical "image → ML model input" pipeline as both a 2-kernel
sequential and a single fused kernel:
Standalone:
hwc_u8_to_chw_f32_normalize<R> — layout transpose + per-channel normalize
Composable primitive:
normalize_chan_u8_to_f32(c, mean, inv_std)
Fused launcher:
resize_to_chw_normalize_with_weights<R> — bilinear resize + per-channel
ImageNet-style normalize + CHW layout in one kernel pass
Bench results on Jetson Orin Nano (Pipeline 2: resize → normalize → CHW f32):
size sequential fused speedup
1024² out 1336 1688 1.26x
1080p → 540p 1124 2243 2.00x ⭐
2048² out 1162 2349 2.02x
4096² out 1571 1978 1.26x
4096² out (8K) 1788 2146 1.20x
Real-world headline: our CHW-fused 1080p→540p ML preprocessing
(resize + ImageNet-style per-channel normalize + CHW transpose) =
231 μs. VPI's *just-the-resize* = 593 μs. Complete cubecl preprocessing
runs 2.6x faster than VPI does the resize alone.
CHW fusion speedup is smaller than gray fusion at large sizes because
the f32 CHW output is 3x larger than gray (12 vs 4 bytes/pixel) so
output traffic dominates and saving the intermediate buffer matters
less proportionally.
…of resize Adds Pipeline 0 (resize-only) and a side-by-side summary at 1080p→540p to the fusion bench. The headline finding: Pipeline median(μs) vs P0 (resize-only) P0: resize only 168.9 1.00× baseline P1 sequential: resize→gray→norm 500.8 3× SLOWER P1 fused: same in 1 kernel 173.2 0.98× ← gray+norm is FREE P2 sequential: resize→CHW+norm 503.2 3× slower P2 fused: same in 1 kernel 242.8 0.70× ← CHW costs 1.4× resize The 3× sequential penalty is the DRAM round-trip tax (intermediate buffer write + read between every kernel). Fusion eliminates it. The "P1 fused = 0.98× of resize-only" result is the API pitch in one sentence: adding gray+normalize to a fused kernel literally costs nothing because the bandwidth-bound kernel has idle compute slots, and there are no new memory accesses. vs NVIDIA VPI (just-the-resize at 1080p = 593 μs): P1 fused (resize+gray+norm): 3.4× faster while doing 3× the work P2 fused (resize+CHW+norm): 2.4× faster while producing model-ready tensor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Self-contained engineering prototype evaluating cubecl as a portable GPU/CPU compute backend for kornia-rs image kernels. Lives in a new sub-workspace crate
crates/kornia-cubecl/so the heavy cubecl dep tree (cubecl-cuda + cudarc, cubecl-cpu + MLIR) doesn't infect the main workspace's resolver.This PR is intended as a starting point for GSoC — see "For @cjpurackal and @Incharajayaram" section below.
Headline numbers (Jetson Orin Nano, head-to-head with NVIDIA VPI 3.2.4)
At 1080p → 540p (typical ML preprocessing):
Full numbers across 6 sizes (256² → 8K) in
crates/kornia-cubecl/RESULTS.md. Bit-exact correctness vsfast_image_resizeNEON (max_diff = 0on all 4 sizes tested).What's in this prototype
#[cube]composable primitives (inlined at codegen):sample_bilinear_u8_rgb_pixel,rgb_to_gray_u8,normalize_u8_to_f32,normalize_chan_u8_to_f32resize (6 variants — baseline, x4, x16, with-pre-uploaded-weights, x4_pw, pw_wide),
rgb_to_gray_u8,normalize_u8_to_f32,hwc_u8_to_chw_f32_normalizeresize_to_gray_normalize_with_weights(HWC f32 gray output)resize_to_chw_normalize_with_weights(CHW f32 RGB output, ImageNet-style)tests/correctness.rs— passes bit-exact on cubecl-cpuexamples/bench_min.rs— single-op size sweep, NEON vs cubecl-cpu vs cubecl-cudaexamples/bench_fusion.rs— pipeline fusion, sequential vs fused, with side-by-side summaryexamples/vpi_bench.py— VPI baseline (Python,vpi3-python-srcpackage)Key learnings (full analysis in RESULTS.md)
resize + RGB→gray + normalizecosts 0.98× of resize-only — the extra ops fill idle bandwidth-bound pipeline slots.x16(fewer threads, less overhead); cubecl-cuda wantsx1(more threads, max occupancy). Same kernel, different optimal launch geometry per backend.fallback-latestfeature defaults to CUDA 13.2 symbol bindings, which Jetson's libcuda doesn't have. Build withCUDARC_CUDA_VERSION=12060.How the cuda numbers were unblocked
First attempt panicked:
libcuda.so: undefined symbol: cuCoredumpDeregisterCompleteCallback(a CUDA 13.2 symbol cudarc binds when nvcc isn't on PATH). Fixed by settingCUDARC_CUDA_VERSION=12060beforecargo buildso cudarc binds only CUDA 12.6 symbols matching Jetson's libcuda. Documented in RESULTS.md.How to reproduce
For @cjpurackal and @Incharajayaram — GSoC starting point
Hi! This is a draft prototype intended to give you a concrete, measurable baseline if you take on cubecl integration as a GSoC project. The crate is fully self-contained (sub-workspace), so you can iterate without breaking anything in the main kornia-rs build.
What's done (use as starting point):
Suggested directions to take it further (in roughly increasing-effort order):
Context<R>+Plan<R>design that hides the cubecl machinery behind a kornia-idiomatic surface. Doc not yet written; we punted that to follow-up.rgb_to_yuv,nearest,bicubic,lanczos,gaussian_blur,gradientas primitives + standalone launchers + pre-fused pipelines.cudaMemcpy(~70 ms at 8K) because cubecl-cuda doesn't know about Tegra's unified memory. Pinned/managed memory would eliminate this round-trip entirely.Vector<u8, N>SIMD on cubecl-cpu. Currently cubecl-cpu's MLIR backend emits scalar code, leaving 5-6× perf on the table vs hand-tuned NEON. Adding explicitLine<u8>/Vector<u8>to the primitives may close it.cubecl-fusionautomatic op-graph fusion. cubecl has a fusion runtime used by Burn for tensor ops; not yet evaluated for image kernels. Could obsolete the manual fused-launcher tier if mature enough.kornia-imgproc. Makekornia::resize()dispatch to NEON or cubecl based on aBackendenum with the same semantics on both. Now possible since correctness is bit-exact.Build/run gotchas you'll hit:
-6; we worked around by extracting and renaming. See RESULTS.md "How we unblocked cuda on Jetson Orin" section.https://docs.nvidia.com/vpi/perf_tegra234_rescale.jsonare NOT directly comparable to Orin Nano; we benched VPI on the same Orin Nano for honest head-to-head.Happy to walk through any of this on a call. The branch will stay open.
Test plan
cargo test --lib --no-default-features --features cpu) on Jetson Orin Nanomax_diff = 0) vsfast_image_resizeNEON across 4 sizesCUDARC_CUDA_VERSION=12060)