[Bug] MOS optimistic search bug fix on nested Cagra index.#3155
[Bug] MOS optimistic search bug fix on nested Cagra index.#3155Vikasht34 merged 5 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
| * iterations is less than 2, preserving O(1) amortized time. | ||
| * </p> | ||
| */ | ||
| public final class RobustUniqueRandomIterator { |
There was a problem hiding this comment.
LCG based random number generation is almost four times faster than HashSet based deduplication.
Also it does not require additional memory.
HashSet based micro performance
Min: 541 nano
50%: 1959 nano
90%: 3458 nano
95%: 3750 nano
99%: 5291 nano
Max: 43125 nano
LCG based micro performance
Min: 125 nano
50%: 333 nano
90%: 458 nano
95%: 1417 nano
99%: 1625 nano
Max: 12458 nano
Signed-off-by: Navneet Verma <navneev@amazon.com> Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
ca92b6e to
e1efe48
Compare
| final KnnCollectorManager collectorManager = knnQuery.getParentsFilter() == null | ||
| ? new TopKnnCollectorManager(k, indexSearcher) | ||
| : new DiversifyingNearestChildrenKnnCollectorManager(k, knnQuery.getParentsFilter(), indexSearcher); |
There was a problem hiding this comment.
[Clarification needed]
I am little surprised that this is not just related to remote index build it can be for other search path too right?
There was a problem hiding this comment.
Unfortunately, yes. Not only for the Cagra case this was affecting all nested case.
There was a problem hiding this comment.
Thank you. Now it makes more sense to me. :)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3155 +/- ##
============================================
+ Coverage 82.45% 82.48% +0.02%
- Complexity 3856 3867 +11
============================================
Files 417 418 +1
Lines 14423 14449 +26
Branches 1841 1847 +6
============================================
+ Hits 11893 11918 +25
- Misses 1771 1773 +2
+ Partials 759 758 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Problems
We have three problems.
2.1. The seed doc IDs from the second deep dive are not honored. Instead, random entry points are still used.
2.2. Nested index search uses TopKnnCollectorManager instead of DiversifyingNearestChildrenKnnCollectorManager.
Unique Random Entry Points Generation
The fix should be straightforward.
A naive implementation would use a Set to remove duplicates, but we can instead use a random sampling algorithm with O(1) memory.
Second Deep-Dive Search for Nested Index
The memory-optimized search uses an optimistic search strategy:
This second phase is called the second deep-dive search.
The doc IDs collected during the first phase must be used as entry points. Otherwise, the purpose of the second deep dive is defeated.
However, for CAGRA, we currently use random entry points regardless.
Fix
If the KnnCollector's search strategy is KnnSearchStrategy.Seeded, we must honor it and ensure that the search starts from the provided entry points.
Use DiversifyingNearestChildrenKnnCollectorManager
For a nested index, we track the best parent score, not individual child vectors.
For example:
This behavior is implemented by DiversifyingNearestChildrenKnnCollectorManager.
However, we currently use TopKnnCollectorManager for nested indices.
Fix: Select Collector Manager Based on Index Type
We should create the collector manager depending on whether the index is nested.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.