Skip to content

Optimize top-k counting for approximate queries #160

@alexklibisz

Description

@alexklibisz

Currently the biggest bottleneck at query time is the countHits method in MatchHashesAndScoreQuery. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector).

It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?

Some specific numbers:

When running the MatchHashesAndScoreQueryPerformanceSuite, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in the countHits method:

image

When running the ContinuousBenchmark with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time in countHits:

image

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions