Skip to content

Implement CUDA multipass for knn > GPU_MAX_SELECTION_K#7381

Merged
ssheorey merged 5 commits intoisl-org:mainfrom
nicolaloi:nicolaloi/cuda-multipass-knn
Mar 10, 2026
Merged

Implement CUDA multipass for knn > GPU_MAX_SELECTION_K#7381
ssheorey merged 5 commits intoisl-org:mainfrom
nicolaloi:nicolaloi/cuda-multipass-knn

Conversation

@nicolaloi
Copy link
Contributor

@nicolaloi nicolaloi commented Dec 8, 2025

Type

Motivation and Context

The KNN search on GPU breaks silently when the k value is larger than the macro GPU_MAX_SELECTION_K, resulting in a trash output (all 0s, large indices > number of total points, or even negative indices). The macro GPU_MAX_SELECTION_K is 2048 if CUDA_VERSION > 9000, otherwise it is 1024. On CPU, the KNN search obviously has no such limits. To improve the GPU KNN search without altering the macro GPU_MAX_SELECTION_K, a multipass algorithm should be implemented, splitting the KNN search into batches where each batch size is < GPU_MAX_SELECTION_K.

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

I have implemented a multipass algorithm to find large KNN on CUDA, splitting the search into multiple batches not larger than GPU_MAX_SELECTION_K. The main challenge is to mask indices that have already been found in earlier passes/iterations, taking care of tiling and contiguousness.

To improve readability, I have separated the function into two distinct functions, depending on whether or not the multipass algorithm should be used:

if (knn <= GPU_MAX_SELECTION_K) {
KnnSearchCUDASinglePass<T, TIndex>(points, queries, knn, tile_rows,
tile_cols, output_allocator,
point_norms, query_norms);
} else {
KnnSearchCUDAMultiPass<T, TIndex>(points, queries, knn, tile_rows,
tile_cols, output_allocator,
point_norms, query_norms);
}

I have created a script with 120 test cases to test the change with different cases (small/large clouds up to 2 million points, multiple queries, small/very large knn up to 50000). This PR passes all the tests, while the original master branch code does not: cuda_knn_test.py

@update-docs
Copy link

update-docs bot commented Dec 8, 2025

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@nicolaloi nicolaloi changed the title CUDA multipass for knn >= GPU_MAX_SELECTION_K CUDA multipass for knn > GPU_MAX_SELECTION_K Dec 8, 2025
@nicolaloi nicolaloi changed the title CUDA multipass for knn > GPU_MAX_SELECTION_K Implement CUDA multipass for knn > GPU_MAX_SELECTION_K Dec 8, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a multi-pass algorithm for CUDA KNN search to handle k values larger than GPU_MAX_SELECTION_K (1024 or 2048 depending on CUDA version). Previously, KNN searches with k > GPU_MAX_SELECTION_K would silently fail and produce incorrect results. The implementation splits large KNN searches into batches, using a bitmask to track already-selected neighbors across passes.

Key changes:

  • Added multi-pass algorithm with masking to handle k > GPU_MAX_SELECTION_K
  • Split the optimized KNN search into separate single-pass and multi-pass functions
  • Fixed memory stride handling for non-contiguous tensor views in L2Select kernel

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
cpp/open3d/core/nns/KnnSearchOps.cu Implements multi-pass KNN algorithm with masking kernels, separates single-pass and multi-pass logic, and fixes early return initialization
cpp/open3d/core/nns/kernel/L2Select.cuh Adds stride parameters to handle non-contiguous tensor views correctly in distance calculations
CHANGELOG.md Documents the new multi-pass KNN feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nicolaloi
Copy link
Contributor Author

nicolaloi commented Jan 17, 2026

@ssheorey It seems that the test failures are unrelated to this PR.

@ssheorey
Copy link
Member

@nicolaloi thanks for finding and fixing this issue. The PR looks good to me. Can you add one representative test case (that exercises the multipass code) from cuda_knn_test.py to the nn test suite here:

python/test/core/test_nn.py

@ssheorey ssheorey added the status / to merge Looks good, merge after minor updates. label Feb 24, 2026
@nicolaloi
Copy link
Contributor Author

Ok, I'll add it in the next few days.

@ssheorey ssheorey added the status / needs info Waiting for information from reporter / author label Feb 25, 2026
@ssheorey ssheorey added this to the v0.20 milestone Mar 4, 2026
@nicolaloi
Copy link
Contributor Author

@ssheorey the test I have added fails with the main branch but passes with this PR branch.

@nicolaloi nicolaloi force-pushed the nicolaloi/cuda-multipass-knn branch from c91dd75 to 87afdf8 Compare March 5, 2026 23:12
@nicolaloi nicolaloi force-pushed the nicolaloi/cuda-multipass-knn branch from 87afdf8 to 3ac92b1 Compare March 5, 2026 23:15
@ssheorey ssheorey removed the status / needs info Waiting for information from reporter / author label Mar 6, 2026
@ssheorey ssheorey merged commit 7dd7bc1 into isl-org:main Mar 10, 2026
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status / to merge Looks good, merge after minor updates.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

knn_search abnormal behavior when knn > 2048 using GPU, return all 0 or very large random integer array

3 participants