Optimize top-k counting for approximate queries

Currently the biggest bottleneck at query time is the `countHits` method in `MatchHashesAndScoreQuery`. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector). 

It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?

Some specific numbers:

When running the `MatchHashesAndScoreQueryPerformanceSuite`, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in the `countHits` method:

![image](https://user-images.githubusercontent.com/8015228/93715990-b87b8880-fb3a-11ea-80a3-01d7e40dd406.png)

When running the `ContinuousBenchmark` with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time in `countHits`:

![image](https://user-images.githubusercontent.com/8015228/93716081-638c4200-fb3b-11ea-8cfe-6ea9d180cae9.png)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize top-k counting for approximate queries #160

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimize top-k counting for approximate queries #160

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions