Adding KnnVectorStore to dense_vector track for supporting knn-recall operations#518
Adding KnnVectorStore to dense_vector track for supporting knn-recall operations#518pmpailis merged 21 commits intoelastic:masterfrom pmpailis:feature/recall_score_cache
Conversation
dense_vector/track.py
Outdated
| "script_score": { | ||
| "query": {"match_all": {}}, | ||
| "script": { | ||
| "source": "cosineSimilarity(params.query, 'vector') + 1.0", |
There was a problem hiding this comment.
This method (and the whole KnnVectorStore) could be a bit more generic by passing the vector field & the queries file as parameters, and reuse it directly in other tracks (e.g. so_vector).
| return [hit["_id"] for hit in script_query["hits"]["hits"]] | ||
|
|
||
|
|
||
| class KnnVectorStore: |
There was a problem hiding this comment.
Once this is reviewed, we can enable it for the so_vector track as well.
dense_vector/track.py
Outdated
| return [] | ||
|
|
||
| def __repr__(self, *args, **kwargs): | ||
| return "knn-recall-processor" |
There was a problem hiding this comment.
I think we can assume that max_k=1000 or even max_k = 10000 and simplify the code (get rid of KnnRecallProcessor) , what do you think of this?
script_score query performance doesn't change much depending on size as we need to go through all vectors regardless.
There was a problem hiding this comment.
Yeah I see your point! +1 for max_k=1_000 (10_000 might be an overkill for 99% of the cases and we would just end up storing potentially thousands of these lists in memory). We can default to 1_000, and if an operation with more results comes in, we can make an overrequest for that specific task.
mayya-sharipova
left a comment
There was a problem hiding this comment.
@pmpailis Great work, very nicely written!
Overall the code looks good to me, but I left a comment about simplification.
Some other comments:
- Did you check the recall with your changes? are the numbers the same as in the
masterbranch? - Have you check the performance of only recall operations (with and without your changes) (we can temporarily set just for tests
"include-in-reporting": true) - Would be nice if somebody from @elastic/es-perf also review this code
Yeah, recall-wise there weren't any noticeable changes - but I'll double check and post the results here as well to be on the safe side.
Will update the report and post the track's output here as well :)
+ 1 - will ask in the channel for someone to also take a look at this :) |
|
@mayya-sharipova updated the code to remove the |
mayya-sharipova
left a comment
There was a problem hiding this comment.
@pmpailis Thanks, great work! Very nice speedups of the recall operation!
dense_vector/track.py
Outdated
| min_recall = k | ||
|
|
||
| for query in params["queries"]: | ||
| knn_vector_store: KnnVectorStore = KnnVectorStore.get_instance(queries_file, vector_field) |
There was a problem hiding this comment.
We kind of discussed this already via Slack, but I think you'd be better off moving this KnnVectorStore initialisation out of the KnnRecallRunner runner and into to the KnnRecallParamSource param source constructor. That way we cache the file upfront before we even start issuing requests by passing the KnnVectorStore as part of the runner's param arg.
I know you mentioned that the latency of these requests isn't a big concern, so I'll leave it up to you to decide given this is your workload, but in general we try and avoid adding anything blocking/slow to a runner besides the actual operation we're trying to measure.
There was a problem hiding this comment.
Tbh I kept it to KnnRecallRunner because in addition to being interested only in recall, IIUC, based on logs, initializing KnnRecallParamSource happens a number of times throughout the track's execution (e.g. for each recall operation) so I tried avoiding the additional I/O. We do however also get the plus of caching the knn vector store, as this happens only on two separate actors.E.g. logs from adding the KnnVectorStore to the constructor of KnnRecallParamSource with the lru cache:
❯ tail -f /Users/panagiotis.bailis/.rally/logs/rally.log | grep "Initializing KnnVectorStore" ─╯
2023-11-24 15:56:52,129 ActorAddr-(T|:59518)/PID:70822 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 15:57:00,732 ActorAddr-(T|:59571)/PID:70835 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
and without the lru cache:
❯ tail -f /Users/panagiotis.bailis/.rally/logs/rally.log | grep "Initializing KnnVectorStore" ─╯
2023-11-24 16:39:10,13 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,52 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,89 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,126 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:10,163 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:18,794 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:39:41,863 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:40:06,430 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:40:27,997 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
2023-11-24 16:40:51,577 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'
Would it make sense to add the params on the KnnRecallParamSource constructor and then read the file "lazily" when actually calling the params and passing it to the runner? This would lead to the file been read only once instead for each operation. I've made some changes that address this so whenever you have some time could you please give it another 👀 ? 🙏
There was a problem hiding this comment.
Hmm, we shouldn't be constructing a KnnRecallParamSource more than once per worker (actor):
https://github.com/elastic/rally/blob/c1f04a368f5efaa28128dc5341b7082a62a13a34/esrally/driver/driver.py#L1792-L1812
In any case this is your workload/track, so if this implementation makes sense then I have no problems with it, just wanted to point out the critical path bit.
There was a problem hiding this comment.
Hmm, we shouldn't be constructing a KnnRecallParamSource more than once per worker (actor):
IIUC doesn't this codepath execute independently for each operation that the actor is assigned to, as we initialize a new AsyncIoAdapter for each task allocation - and the params_per_task are independent for each run (which makes sense since we have different params per operation)? 🤔 Maybe I'm missing something but - do we already account somehow for having commonly shared resources for similar operations?
In addition to that, we also seem to generate a KnnRecallParamSource when preparing the track and checking for used corpora which also introduces (for the specific case) unnecessary I/O load.
(Note: I think that the same holds for KnnParamSource as well)
There was a problem hiding this comment.
IIUC doesn't this codepath execute independently for each operation that the actor is assigned to, as we initialize a new AsyncIoAdapter for each task allocation
Yes, sorry, I glossed over that part, I should have said "once per worker, per task".
and the params_per_task are independent for each run (which makes sense since we have different params per operation)? 🤔 Maybe I'm missing something but - do we already account somehow for having commonly shared resources for similar operations?
The AsyncExecutor is the part the actually executes the runner, and that gets its per-request parameters from the respective ScheduleHandle.
I took a deeper look at the code but I lack almost all of the required domain knowledge here to make any good advice. I'll leave it up to you how you want to handle it, i.e. you don't need my approval to merge 😆 as these are just general comments.
…rce and passing it as arg
|
Why aren't we statically initializing a list of nearest neighbors? We already know the dataset and the queries. |
To have a static list of vectors we would have to either hash them to a unique value or use a custom set ID during indexing, so that we can keep the references, right? There are a couple of reasons that I opted for a dynamically computed list:
Not something that is of immediate "necessity" but it will make it a bit easier for the users to add & adjust as to their needs. Maybe we could also add an option to the recall operations to specify whether the true neighbors are to be computed on the fly, or be read from an external resource - so that we can have both the flexibility if needed as well as address performance costs for huge datasets. WDYT? |
|
I think this is ok @pmpailis, but it does seem too complicated. But as long as it works :) |
|
I like the current approach of @pmpailis . |
|
@benwtrent if you're also ok with this - we could merge this and if needed we can track the work for offline true neighbors in a separate PR/issue. |
|
I am fine with the PR as long as @b-deam approves. |
It so strange how automation misbehaved. |
|
The root cause of #518 (comment) was addressed, and affected PRs re-opened. |
In this PR we add support for a
KnnVectorStoreto be shared across allknn-recall-*operations in the same thespian actor. As the true exact neighbors of allknn-recalloperations are the same, it would be beneficial to compute them only once and then re-use them for later runs - instead of re-computing them every time when the index itself remains stale.So the idea here is to add a
KnnRecallProcessorto identify the maxkfor the said track from all defined operations, so that we can later use thiskto fetch the needed true exact neighbors and cover for all cases. Then, each runner would consult this vector store to:k(wherek <= max_k) nearest neighbors andA couple of notes:
KnnVectorStoreis initialized by the first runner on each Actor (currently we only have 1 client for thedense_vectortrack so effectively only 1 actor). We could move this if needed to theKnnRecallParamSourceknn-recall-*runners. However as (i) this was already what was happening and (ii) we do not care about performance metrics in recall tasks, I believe that it's safe to define it through that. This could also be beneficial in cases where we might apply sampling and/or randomization on a per-operation level, as we wouldn't have to unnecessary compute true nearest neighbors for every vector in the file.Benchmarks on
masterbranch for thedense_vectortrack from runs w/ & w/o theKnnVectorStoreBenchmarks on a custom branch for the
dense_vectortrack with 35 overall recall operations, ranging fromknn-recall-10-10toknn-recall-1000-4000.Recall & latency metric breakdown for all operations: