Skip to content

Adding KnnVectorStore to dense_vector track for supporting knn-recall operations#518

Merged
pmpailis merged 21 commits intoelastic:masterfrom
pmpailis:feature/recall_score_cache
Nov 30, 2023
Merged

Adding KnnVectorStore to dense_vector track for supporting knn-recall operations#518
pmpailis merged 21 commits intoelastic:masterfrom
pmpailis:feature/recall_score_cache

Conversation

@pmpailis
Copy link
Copy Markdown
Contributor

@pmpailis pmpailis commented Nov 20, 2023

In this PR we add support for a KnnVectorStore to be shared across all knn-recall-* operations in the same thespian actor. As the true exact neighbors of all knn-recall operations are the same, it would be beneficial to compute them only once and then re-use them for later runs - instead of re-computing them every time when the index itself remains stale.

So the idea here is to add a KnnRecallProcessor to identify the max k for the said track from all defined operations, so that we can later use this k to fetch the needed true exact neighbors and cover for all cases. Then, each runner would consult this vector store to:

  • pick up the query vectors to use
  • fetch the true k (where k <= max_k) nearest neighbors and
  • compute overlap between result lists and report recall.

A couple of notes:

  • The KnnVectorStore is initialized by the first runner on each Actor (currently we only have 1 client for the dense_vector track so effectively only 1 actor). We could move this if needed to the KnnRecallParamSource
  • The results are loaded lazily whenever requested. This means that we don't do prefetching & the first runner will take a substantial performance hit compared to the rest of the knn-recall-* runners. However as (i) this was already what was happening and (ii) we do not care about performance metrics in recall tasks, I believe that it's safe to define it through that. This could also be beneficial in cases where we might apply sampling and/or randomization on a per-operation level, as we wouldn't have to unnecessary compute true nearest neighbors for every vector in the file.

Benchmarks on master branch for the dense_vector track from runs w/ & w/o the KnnVectorStore

branch using KnnVectorStore test mode total runtime (in seconds)
master yes yes 83 (±2)
master no yes 96 (±3)
master yes no 1517 (±30)
master no no 1949 ( ±53)

Benchmarks on a custom branch for the dense_vector track with 35 overall recall operations, ranging from knn-recall-10-10 to knn-recall-1000-4000.

branch using KnnVectorStore test mode total runtime (in seconds)
custom yes yes 341 ( ± 10)
custom no yes 563 (±24)
custom yes no 1851 (±93)
custom no no 5868 ( ±125)

Recall & latency metric breakdown for all operations:

metric using KnnVectorStore test mode knn-recall-10-100 knn-recall-100-1000 knn-recall-10-10 knn-recall-100-100 knn-recall-1000-1000
100th percentile latency yes yes 24909 5294.24 2213.13 4307.95 22592.9
100th percentile latency no yes 5599 9173 4227 8025 38872
avg recall yes yes 1 1 0.95195 0.99929 1
avg recall no yes 0.99995 1 0.9518 0.99931 1
metric using KnnVectorStore test mode knn-recall-10-100 knn-recall-100-1000 knn-recall-10-10 knn-recall-100-100 knn-recall-1000-1000
100th percentile latency yes no 134558 10754 2116.26 3922.86 17790.2
100th percentile latency no no 114021 126425 112871 116608 141182
avg recall yes no 0.9836 0.998695 0.79065 0.95422 0.991223
avg recall no no 0.98465 0.998755 0.78765 0.954505 0.9912535

@pmpailis pmpailis requested review from benwtrent, jdconrad and mayya-sharipova and removed request for jdconrad November 20, 2023 12:28
@pmpailis pmpailis removed the request for review from benwtrent November 20, 2023 13:58
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query, 'vector') + 1.0",
Copy link
Copy Markdown
Contributor Author

@pmpailis pmpailis Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method (and the whole KnnVectorStore) could be a bit more generic by passing the vector field & the queries file as parameters, and reuse it directly in other tracks (e.g. so_vector).

return [hit["_id"] for hit in script_query["hits"]["hits"]]


class KnnVectorStore:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is reviewed, we can enable it for the so_vector track as well.

return []

def __repr__(self, *args, **kwargs):
return "knn-recall-processor"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume that max_k=1000 or even max_k = 10000 and simplify the code (get rid of KnnRecallProcessor) , what do you think of this?

script_score query performance doesn't change much depending on size as we need to go through all vectors regardless.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see your point! +1 for max_k=1_000 (10_000 might be an overkill for 99% of the cases and we would just end up storing potentially thousands of these lists in memory). We can default to 1_000, and if an operation with more results comes in, we can make an overrequest for that specific task.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Copy Markdown
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmpailis Great work, very nicely written!
Overall the code looks good to me, but I left a comment about simplification.

Some other comments:

  • Did you check the recall with your changes? are the numbers the same as in the master branch?
  • Have you check the performance of only recall operations (with and without your changes) (we can temporarily set just for tests "include-in-reporting": true)
  • Would be nice if somebody from @elastic/es-perf also review this code

@b-deam b-deam requested a review from a team November 22, 2023 06:40
@pmpailis
Copy link
Copy Markdown
Contributor Author

pmpailis commented Nov 22, 2023

Did you check the recall with your changes? are the numbers the same as in the master branch?

Yeah, recall-wise there weren't any noticeable changes - but I'll double check and post the results here as well to be on the safe side.

Have you check the performance of only recall operations (with and without your changes) (we can temporarily set just for tests "include-in-reporting": true)

Will update the report and post the track's output here as well :)

Would be nice if somebody from @elastic/es-perf also review this code

+ 1 - will ask in the channel for someone to also take a look at this :)

@pmpailis
Copy link
Copy Markdown
Contributor Author

@mayya-sharipova updated the code to remove the KnnRecallProcessor in favor of a default max_k size of 1_000 & added the additional metrics discussed to the description :)

Copy link
Copy Markdown
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmpailis Thanks, great work! Very nice speedups of the recall operation!

min_recall = k

for query in params["queries"]:
knn_vector_store: KnnVectorStore = KnnVectorStore.get_instance(queries_file, vector_field)
Copy link
Copy Markdown
Member

@b-deam b-deam Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kind of discussed this already via Slack, but I think you'd be better off moving this KnnVectorStore initialisation out of the KnnRecallRunner runner and into to the KnnRecallParamSource param source constructor. That way we cache the file upfront before we even start issuing requests by passing the KnnVectorStore as part of the runner's param arg.

I know you mentioned that the latency of these requests isn't a big concern, so I'll leave it up to you to decide given this is your workload, but in general we try and avoid adding anything blocking/slow to a runner besides the actual operation we're trying to measure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I kept it to KnnRecallRunner because in addition to being interested only in recall, IIUC, based on logs, initializing KnnRecallParamSource happens a number of times throughout the track's execution (e.g. for each recall operation) so I tried avoiding the additional I/O. We do however also get the plus of caching the knn vector store, as this happens only on two separate actors.E.g. logs from adding the KnnVectorStore to the constructor of KnnRecallParamSource with the lru cache:

❯ tail -f /Users/panagiotis.bailis/.rally/logs/rally.log | grep "Initializing KnnVectorStore"                                                                                          ─╯
2023-11-24 15:56:52,129 ActorAddr-(T|:59518)/PID:70822 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 15:57:00,732 ActorAddr-(T|:59571)/PID:70835 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'

and without the lru cache:

❯ tail -f /Users/panagiotis.bailis/.rally/logs/rally.log  | grep "Initializing KnnVectorStore"                                                                                         ─╯
2023-11-24 16:39:10,13 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,52 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,89 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,126 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,163 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:18,794 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:41,863 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:40:06,430 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:40:27,997 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:40:51,577 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'

Would it make sense to add the params on the KnnRecallParamSource constructor and then read the file "lazily" when actually calling the params and passing it to the runner? This would lead to the file been read only once instead for each operation. I've made some changes that address this so whenever you have some time could you please give it another 👀 ? 🙏

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we shouldn't be constructing a KnnRecallParamSource more than once per worker (actor):
https://github.com/elastic/rally/blob/c1f04a368f5efaa28128dc5341b7082a62a13a34/esrally/driver/driver.py#L1792-L1812

In any case this is your workload/track, so if this implementation makes sense then I have no problems with it, just wanted to point out the critical path bit.

Copy link
Copy Markdown
Contributor Author

@pmpailis pmpailis Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we shouldn't be constructing a KnnRecallParamSource more than once per worker (actor):

IIUC doesn't this codepath execute independently for each operation that the actor is assigned to, as we initialize a new AsyncIoAdapter for each task allocation - and the params_per_task are independent for each run (which makes sense since we have different params per operation)? 🤔 Maybe I'm missing something but - do we already account somehow for having commonly shared resources for similar operations?

In addition to that, we also seem to generate a KnnRecallParamSource when preparing the track and checking for used corpora which also introduces (for the specific case) unnecessary I/O load.

(Note: I think that the same holds for KnnParamSource as well)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC doesn't this codepath execute independently for each operation that the actor is assigned to, as we initialize a new AsyncIoAdapter for each task allocation

Yes, sorry, I glossed over that part, I should have said "once per worker, per task".

and the params_per_task are independent for each run (which makes sense since we have different params per operation)? 🤔 Maybe I'm missing something but - do we already account somehow for having commonly shared resources for similar operations?

The AsyncExecutor is the part the actually executes the runner, and that gets its per-request parameters from the respective ScheduleHandle.

I took a deeper look at the code but I lack almost all of the required domain knowledge here to make any good advice. I'll leave it up to you how you want to handle it, i.e. you don't need my approval to merge 😆 as these are just general comments.

@benwtrent
Copy link
Copy Markdown
Member

Why aren't we statically initializing a list of nearest neighbors?

We already know the dataset and the queries.

@pmpailis
Copy link
Copy Markdown
Contributor Author

Why aren't we statically initializing a list of nearest neighbors?

We already know the dataset and the queries.

To have a static list of vectors we would have to either hash them to a unique value or use a custom set ID during indexing, so that we can keep the references, right? There are a couple of reasons that I opted for a dynamically computed list:

  • the execution cost is not prohibitive as it is something that we already do multiple times
  • to avoid potential sync/misconfiguration issues issues with ids / hashes / similarity metrics etc, and make it easier for someone to add a new track with recall metrics enabled
  • to make it easier to support a random set of queries from the dataset (or true random) and potential different mappings if we decide to do so - instead of having a true-neighbors file for each combination

Not something that is of immediate "necessity" but it will make it a bit easier for the users to add & adjust as to their needs. Maybe we could also add an option to the recall operations to specify whether the true neighbors are to be computed on the fly, or be read from an external resource - so that we can have both the flexibility if needed as well as address performance costs for huge datasets. WDYT?

@benwtrent
Copy link
Copy Markdown
Member

I think this is ok @pmpailis, but it does seem too complicated. But as long as it works :)

@mayya-sharipova
Copy link
Copy Markdown
Contributor

I like the current approach of @pmpailis .
I think when rally adds an ability to have docs with custom IDs, we can provide a static list of expected IDs, but for now the current approach seems to be the best.

@pmpailis
Copy link
Copy Markdown
Contributor Author

@benwtrent if you're also ok with this - we could merge this and if needed we can track the work for offline true neighbors in a separate PR/issue.

@benwtrent
Copy link
Copy Markdown
Member

I am fine with the PR as long as @b-deam approves.

Copy link
Copy Markdown
Member

@b-deam b-deam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubber stamped!

@pmpailis pmpailis merged commit 8325625 into elastic:master Nov 30, 2023
@pmpailis pmpailis mentioned this pull request Nov 30, 2023
@pmpailis
Copy link
Copy Markdown
Contributor Author

pmpailis commented Nov 30, 2023

It seems that a number of PRs were closed when this one was merged 😞

#340
#353
#376
#395
#440
#459
#464
#500
#505
#512
#520
#522
#525
#527

@mayya-sharipova
Copy link
Copy Markdown
Contributor

It seems that a number of PRs were closed when this one was merged 😞

It so strange how automation misbehaved.

@gbanasiak
Copy link
Copy Markdown
Contributor

The root cause of #518 (comment) was addressed, and affected PRs re-opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants