Implementing Optical flow LK#755
Conversation
|
@cjpurackal @sidd-27 Please review |
|
Hello , I've run the official benchmarks from Hardware/Environment Environment: Local CPU (Apple Silicon/Mac)
|
Can you try optimizing it more? the performance is same as before when it was run without scharr kernels |
- Fix massive threading overhead in scharr spatial gradient by replacing inner per-pixel with - Speed up both scharr and sobel 3x3 convolutions by extracting an interior fast-path that skips expensive per-pixel bounds checking bounds - Runs Formatting C++ files... Done! to ensure formatting passes CI
|
You are completely right that swapping Sobel for Scharr didn't naturally change performance. Since they are both mathematically similar 3x3 convolutions, their computational cost is identical! However, while profiling this based on your comment, I discovered two major pre-existing bottlenecks in the original spatial gradient code: It was heavily doing .min().max() bounds-checking on every single pixel during the 3x3 convolution loop. Local benchmarks show this improved overall optical flow throughput by ~15% up to 20% across workloads compared to the original code without Scharr kernels! Benchmark Results (New Optimized Scharr vs Original Unoptimized Sobel): OpticalFlowPyrLK/kornia_cpu/points/256x256/50 OpticalFlowPyrLK/kornia_cpu/points/256x256/100 OpticalFlowPyrLK/kornia_cpu/points/256x256/200 OpticalFlowPyrLK/kornia_cpu/points/256x256/500 |
There was a problem hiding this comment.
Pull request overview
Adds a new sparse pyramidal Lucas–Kanade (PyrLK) optical flow implementation to kornia-imgproc, exposing a public API (including a precompute path) and providing benchmarks/tests to validate correctness and performance.
Changes:
- Introduces
optical_flow_pyr_lkmodule with PyrLK parameters/results, precompute builder, and main optical flow entrypoints. - Exposes the new module from
crates/kornia-imgproc/src/lib.rsand adds a Criterion benchmark for optical flow. - Updates bench configuration and refreshes
Cargo.lock(including broad workspace version churn).
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/kornia-imgproc/src/optical_flow_pyr_lk.rs | New PyrLK implementation, precompute API, and unit tests. |
| crates/kornia-imgproc/src/lib.rs | Exports the new optical_flow_pyr_lk module. |
| crates/kornia-imgproc/benches/bench_optical_flow.rs | Adds Criterion benchmarks (plus optional OpenCV comparison via feature). |
| crates/kornia-imgproc/Cargo.toml | Registers the new benchmark target. |
| Cargo.lock | Updates lockfile with wide-ranging workspace/dependency version changes. |
| pub fn build_lk_precomputed<A: ImageAllocator>( | ||
| prev_img: &Image<f32, 1, A>, | ||
| next_img: &Image<f32, 1, A>, | ||
| max_level: usize, | ||
| ) -> Result<PyrLKPrecomputed<A>, PyrLKError> { | ||
| let mut prev_pyr = Vec::with_capacity(max_level + 1); | ||
| let mut next_pyr = Vec::with_capacity(max_level + 1); | ||
| prev_pyr.push(prev_img.clone()); | ||
| next_pyr.push(next_img.clone()); | ||
|
|
There was a problem hiding this comment.
build_lk_precomputed is a public function but doesn’t validate that prev_img.size() == next_img.size(). If sizes differ, the function will currently fail later with a lower-level ImageError::InvalidImageSize, which is less actionable than the dedicated PyrLKError::ImageSizeMismatch. Consider adding an upfront size check and returning PyrLKError::ImageSizeMismatch for consistency with calc_optical_flow_pyr_lk.
| } | ||
|
|
||
| for lvl in (0..=params.max_level).rev() { | ||
| let scale = 1.0 / (1 << lvl) as f32; |
There was a problem hiding this comment.
scale is computed via (1 << lvl) as f32. For large max_level values, 1 << lvl can overflow/wrap (and can panic in debug builds), leading to incorrect scaling or division by zero/inf. Consider using 2.0_f32.powi(lvl as i32) (or checked_shl with a validated max_level upper bound) to make the computation robust for all user inputs.
| let scale = 1.0 / (1 << lvl) as f32; | |
| let scale = 1.0 / 2.0_f32.powi(lvl as i32); |
| if params.win_size != WIN_SIZE { | ||
| return (pt, 0, 0.0); | ||
| } |
There was a problem hiding this comment.
track_feature is hard-coded to a 21x21 window (WIN_SIZE/HALF_WIN) and silently returns status=0 when params.win_size != 21. This conflicts with the public API contract (any positive odd win_size is accepted) and makes callers think tracking failed rather than that parameters are unsupported. Either implement runtime-sized windows (derive half_win/win_pixels from params.win_size) or return a PyrLKError::InvalidWindowSize error instead of an Ok result with status=0.
| if params.win_size != WIN_SIZE { | |
| return (pt, 0, 0.0); | |
| } |
| let mut dx = initial_flow.map_or(0.0, |d| d[0]); | ||
| let mut dy = initial_flow.map_or(0.0, |d| d[1]); | ||
| let mut valid = true; | ||
| let mut tracking_error = 0.0f32; | ||
| if params.win_size != WIN_SIZE { | ||
| return (pt, 0, 0.0); | ||
| } | ||
|
|
||
| for lvl in (0..=params.max_level).rev() { | ||
| let scale = 1.0 / (1 << lvl) as f32; | ||
| let xc = pt[0] * scale; | ||
| let yc = pt[1] * scale; | ||
|
|
||
| if lvl < params.max_level { | ||
| dx *= 2.0; | ||
| dy *= 2.0; | ||
| } |
There was a problem hiding this comment.
When use_initial_flow is enabled, the initial (dx,dy) is computed in full-resolution pixel units, but track_feature applies it directly at the coarsest pyramid level (lvl = max_level) without scaling by 1/(2^max_level). In a pyramidal LK solver the initial flow should be scaled to the current level (and then doubled when moving to finer levels), otherwise the coarse-level search starts from a grossly incorrect offset.
| grad_y_levels: precomputed.grad_y_pyr.len(), | ||
| }); | ||
| } | ||
|
|
There was a problem hiding this comment.
calc_optical_flow_pyr_lk_with_precomputed only validates the number of pyramid levels. Since this is a public API that can be called with arbitrary PyrLKPrecomputed, it should also validate (at least) that prev/next/grad pyramids have matching per-level sizes, and that params.win_size is a supported/valid value (odd, >0, and consistent with the internal window implementation). Without these checks, malformed precomputed inputs can lead to out-of-bounds indexing/panics during sampling.
| // Validate that per-level image sizes match across the precomputed pyramids. | |
| for level in 0..expected_levels { | |
| let prev_size: ImageSize = precomputed.prev_pyr[level].size(); | |
| if precomputed.next_pyr[level].size() != prev_size | |
| || precomputed.grad_x_pyr[level].size() != prev_size | |
| || precomputed.grad_y_pyr[level].size() != prev_size | |
| { | |
| // Reuse the existing error variant to signal invalid precomputed data. | |
| return Err(PyrLKError::InvalidPrecomputedLevels { | |
| expected_levels, | |
| prev_levels: precomputed.prev_pyr.len(), | |
| next_levels: precomputed.next_pyr.len(), | |
| grad_x_levels: precomputed.grad_x_pyr.len(), | |
| grad_y_levels: precomputed.grad_y_pyr.len(), | |
| }); | |
| } | |
| } | |
| // Validate window size: must be positive and odd to match the internal kernel assumptions. | |
| if params.win_size <= 0 || params.win_size % 2 == 0 { | |
| return Err(PyrLKError::InvalidWindowSize { | |
| win_size: params.win_size, | |
| }); | |
| } |
| let mut prev_pyr = Vec::with_capacity(max_level + 1); | ||
| let mut next_pyr = Vec::with_capacity(max_level + 1); | ||
| prev_pyr.push(prev_img.clone()); | ||
| next_pyr.push(next_img.clone()); | ||
|
|
There was a problem hiding this comment.
build_lk_precomputed pushes prev_img.clone() / next_img.clone() into the pyramids. In this codebase, Image cloning performs a deep copy of the underlying tensor storage, so this duplicates the full input images in memory before even building lower pyramid levels. Consider either taking ownership of the input images (so level-0 can be moved instead of cloned), or documenting this cost clearly to avoid surprising allocations for large images.
|
@Incharajayaram @namanguptagit LGTM, address the sensible copilot comments. Also can you try this on: Curious to see how it performs. |
Sure @cjpurackal will do it shortly and share the benchmarks once its done |
I have addressed the copilot comments in this pr : Incharajayaram#1 |
| } | ||
|
|
||
| #[inline] | ||
| fn bilinear_sample_from_slice(data: &[f32], width: usize, height: usize, x: f32, y: f32) -> f32 { |
There was a problem hiding this comment.
there are internal routines for that
| let mut valid = true; | ||
| let mut tracking_error = 0.0f32; | ||
| if params.win_size != WIN_SIZE { | ||
| return (pt, 0, 0.0); |
| && !(xc >= hw && yc >= hw && xc < (width as f32 - hw) && yc < (height as f32 - hw)) | ||
| { | ||
| valid = false; | ||
| break; |
There was a problem hiding this comment.
this should return something more meaningful
| match params.border_mode { | ||
| BorderMode::Clamp => { | ||
| let mut idx = 0usize; | ||
| for wy in -HALF_WIN..=HALF_WIN { |
There was a problem hiding this comment.
too much boilerplate code;block codes are practically the same below
|
@namanguptagit thanks for showing the 256x256 results, can you also try running benchmarks for 512x512? with 800 points? |
|
Sure will share results for that as well |
|
OpticalFlowPyrLK/kornia_cpu/points/512x512/800 |
Incy optical flow
|
@namanguptagit nice, thats a huge improvement |
|
please start reporting visual results of this, not only tests numbers. we need photometric and reprojection errors |
|
For real images I'd use real outdoor datasets eg for VO that usually have more textures |
|
I tested on the EuRoC Machine Hall dataset (MH_01_easy link), and extended to all Machine Hall sequences. Configuration: EuRoC Machine Hall Optical Flow Validation (step=1, max_pairs=200)
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
You can also share your feedback on Copilot code review. Take the survey.
| pub fn build_pyramid<const C: usize>( | ||
| src: &Image<f32, C, kornia_tensor::CpuAllocator>, | ||
| max_level: usize, | ||
| ) -> Result<Vec<Image<f32, C, kornia_tensor::CpuAllocator>>, ImageError> { | ||
| let mut pyramid = Vec::with_capacity(max_level + 1); | ||
| pyramid.push(src.clone()); | ||
|
|
||
| for l in 0..max_level { | ||
| let current_img = &pyramid[l]; | ||
| let new_size = ImageSize { | ||
| width: current_img.cols().div_ceil(2), | ||
| height: current_img.rows().div_ceil(2), | ||
| }; | ||
| let mut downsampled = Image::from_size_val(new_size, 0.0, kornia_tensor::CpuAllocator)?; |
| pub fn scharr_kernel_1d(kernel_size: usize) -> (Vec<f32>, Vec<f32>) { | ||
| let (kernel_x, kernel_y) = match kernel_size { | ||
| 3 => (vec![-1.0, 0.0, 1.0], vec![3.0, 10.0, 3.0]), | ||
| _ => panic!("Invalid kernel size for scharr kernel"), | ||
| }; | ||
| (kernel_x, kernel_y) |
| pub fn scharr<const C: usize, A1: ImageAllocator, A2: ImageAllocator>( | ||
| src: &Image<f32, C, A1>, | ||
| dst: &mut Image<f32, C, A2>, | ||
| kernel_size: usize, | ||
| ) -> Result<(), ImageError> { | ||
| let (kernel_x, kernel_y) = kernels::scharr_kernel_1d(kernel_size); | ||
|
|
||
| let mut gx = Image::<f32, C, _>::from_size_val(src.size(), 0.0, CpuAllocator)?; | ||
| separable_filter(src, &mut gx, &kernel_x, &kernel_y)?; | ||
|
|
| /// Computer scharr filter | ||
| /// | ||
| /// # Arguments |
| if params.use_initial_flow | ||
| && next_pts_in | ||
| .as_ref() | ||
| .is_some_and(|pts| pts.len() != prev_pts.len()) | ||
| { | ||
| return Err(PyrLKError::InitialFlowLengthMismatch { | ||
| expected: prev_pts.len(), | ||
| provided: next_pts_in.as_ref().map_or(0, |pts| pts.len()), | ||
| }); |
| if params.use_initial_flow | ||
| && next_pts_in | ||
| .as_ref() | ||
| .is_some_and(|pts| pts.len() != prev_pts.len()) | ||
| { | ||
| return Err(PyrLKError::InitialFlowLengthMismatch { | ||
| expected: prev_pts.len(), | ||
| provided: next_pts_in.as_ref().map_or(0, |pts| pts.len()), | ||
| }); | ||
| } |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs within 7 days. Thank you for your contributions! |
|
This pull request has been automatically closed due to inactivity. Feel free to reopen it if you would like to continue working on it. |
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
edgarriba
left a comment
There was a problem hiding this comment.
We should add benchmarks against opencv and vpi and match/improve using neon/avx
Yes |


📝 Description
implements #46
Important:
🛠️ Changes Made
🧪 How Was This Tested?
Ran LK benchmark with OpenCV feature and verified benchmark groups execute.
Compared Rust full path, Rust precomputed path, and OpenCV on identical benchmark setup.
Benchmarking snippets:
Rust (full) median: 16.145 ms
Rust (precomputed) median: 6.7239 ms
OpenCV median: 7.0739 ms
Rust precomputed is slightly lower median than OpenCV in this run, but OpenCV variance is high.
🕵️ AI Usage Disclosure
Check one of the following:
🚦 Checklist
💭 Additional Context
Add any other context or screenshots about the pull request here.