-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Currently the biggest bottleneck at query time is the countHits method in MatchHashesAndScoreQuery. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector).
It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?
Some specific numbers:
When running the MatchHashesAndScoreQueryPerformanceSuite, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in the countHits method:
When running the ContinuousBenchmark with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time in countHits:

