Adding KnnVectorStore to dense_vector track for supporting knn-recall operations by pmpailis · Pull Request #518 · elastic/rally-tracks

pmpailis · 2023-11-20T12:28:06Z

In this PR we add support for a KnnVectorStore to be shared across all knn-recall-* operations in the same thespian actor. As the true exact neighbors of all knn-recall operations are the same, it would be beneficial to compute them only once and then re-use them for later runs - instead of re-computing them every time when the index itself remains stale.

So the idea here is to add a KnnRecallProcessor to identify the max k for the said track from all defined operations, so that we can later use this k to fetch the needed true exact neighbors and cover for all cases. Then, each runner would consult this vector store to:

pick up the query vectors to use
fetch the true k (where k <= max_k) nearest neighbors and
compute overlap between result lists and report recall.

A couple of notes:

The KnnVectorStore is initialized by the first runner on each Actor (currently we only have 1 client for the dense_vector track so effectively only 1 actor). We could move this if needed to the KnnRecallParamSource
The results are loaded lazily whenever requested. This means that we don't do prefetching & the first runner will take a substantial performance hit compared to the rest of the knn-recall-* runners. However as (i) this was already what was happening and (ii) we do not care about performance metrics in recall tasks, I believe that it's safe to define it through that. This could also be beneficial in cases where we might apply sampling and/or randomization on a per-operation level, as we wouldn't have to unnecessary compute true nearest neighbors for every vector in the file.

Benchmarks on master branch for the dense_vector track from runs w/ & w/o the KnnVectorStore

branch	using KnnVectorStore	test mode	total runtime (in seconds)
master	yes	yes	83 (±2)
master	no	yes	96 (±3)
master	yes	no	1517 (±30)
master	no	no	1949 ( ±53)

Benchmarks on a custom branch for the dense_vector track with 35 overall recall operations, ranging from knn-recall-10-10 to knn-recall-1000-4000.

branch	using KnnVectorStore	test mode	total runtime (in seconds)
custom	yes	yes	341 ( ± 10)
custom	no	yes	563 (±24)
custom	yes	no	1851 (±93)
custom	no	no	5868 ( ±125)

Recall & latency metric breakdown for all operations:

metric	using KnnVectorStore	test mode	knn-recall-10-100	knn-recall-100-1000	knn-recall-10-10	knn-recall-100-100	knn-recall-1000-1000
100th percentile latency	yes	yes	24909	5294.24	2213.13	4307.95	22592.9
100th percentile latency	no	yes	5599	9173	4227	8025	38872
avg recall	yes	yes	1	1	0.95195	0.99929	1
avg recall	no	yes	0.99995	1	0.9518	0.99931	1

metric	using KnnVectorStore	test mode	knn-recall-10-100	knn-recall-100-1000	knn-recall-10-10	knn-recall-100-100	knn-recall-1000-1000
100th percentile latency	yes	no	134558	10754	2116.26	3922.86	17790.2
100th percentile latency	no	no	114021	126425	112871	116608	141182
avg recall	yes	no	0.9836	0.998695	0.79065	0.95422	0.991223
avg recall	no	no	0.98465	0.998755	0.78765	0.954505	0.9912535

pmpailis · 2023-11-20T14:36:36Z

dense_vector/track.py

+                "script_score": {
+                    "query": {"match_all": {}},
+                    "script": {
+                        "source": "cosineSimilarity(params.query, 'vector') + 1.0",


This method (and the whole KnnVectorStore) could be a bit more generic by passing the vector field & the queries file as parameters, and reuse it directly in other tracks (e.g. so_vector).

pmpailis · 2023-11-20T16:33:22Z

dense_vector/track.py

+    return [hit["_id"] for hit in script_query["hits"]["hits"]]
+
+
+class KnnVectorStore:


Once this is reviewed, we can enable it for the so_vector track as well.

mayya-sharipova · 2023-11-21T14:01:59Z

dense_vector/track.py

+        return []
+
+    def __repr__(self, *args, **kwargs):
+        return "knn-recall-processor"


I think we can assume that max_k=1000 or even max_k = 10000 and simplify the code (get rid of KnnRecallProcessor) , what do you think of this?

script_score query performance doesn't change much depending on size as we need to go through all vectors regardless.

Yeah I see your point! +1 for max_k=1_000 (10_000 might be an overkill for 99% of the cases and we would just end up storing potentially thousands of these lists in memory). We can default to 1_000, and if an operation with more results comes in, we can make an overrequest for that specific task.

mayya-sharipova

@pmpailis Great work, very nicely written!
Overall the code looks good to me, but I left a comment about simplification.

Some other comments:

Did you check the recall with your changes? are the numbers the same as in the master branch?
Have you check the performance of only recall operations (with and without your changes) (we can temporarily set just for tests "include-in-reporting": true)
Would be nice if somebody from @elastic/es-perf also review this code

pmpailis · 2023-11-22T06:42:15Z

Did you check the recall with your changes? are the numbers the same as in the master branch?

Yeah, recall-wise there weren't any noticeable changes - but I'll double check and post the results here as well to be on the safe side.

Have you check the performance of only recall operations (with and without your changes) (we can temporarily set just for tests "include-in-reporting": true)

Will update the report and post the track's output here as well :)

Would be nice if somebody from @elastic/es-perf also review this code

+ 1 - will ask in the channel for someone to also take a look at this :)

…t max_k | passing file & vector field as params

pmpailis · 2023-11-22T13:05:47Z

@mayya-sharipova updated the code to remove the KnnRecallProcessor in favor of a default max_k size of 1_000 & added the additional metrics discussed to the description :)

mayya-sharipova

@pmpailis Thanks, great work! Very nice speedups of the recall operation!

b-deam · 2023-11-23T23:25:46Z

dense_vector/track.py

+        min_recall = k

-        for query in params["queries"]:
+        knn_vector_store: KnnVectorStore = KnnVectorStore.get_instance(queries_file, vector_field)


We kind of discussed this already via Slack, but I think you'd be better off moving this KnnVectorStore initialisation out of the KnnRecallRunner runner and into to the KnnRecallParamSource param source constructor. That way we cache the file upfront before we even start issuing requests by passing the KnnVectorStore as part of the runner's param arg.

I know you mentioned that the latency of these requests isn't a big concern, so I'll leave it up to you to decide given this is your workload, but in general we try and avoid adding anything blocking/slow to a runner besides the actual operation we're trying to measure.

Tbh I kept it to KnnRecallRunner because in addition to being interested only in recall, IIUC, based on logs, initializing KnnRecallParamSource happens a number of times throughout the track's execution (e.g. for each recall operation) so I tried avoiding the additional I/O. We do however also get the plus of caching the knn vector store, as this happens only on two separate actors.E.g. logs from adding the KnnVectorStore to the constructor of KnnRecallParamSource with the lru cache:

❯ tail -f /Users/panagiotis.bailis/.rally/logs/rally.log | grep "Initializing KnnVectorStore" ─╯ 2023-11-24 15:56:52,129 ActorAddr-(T|:59518)/PID:70822 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 15:57:00,732 ActorAddr-(T|:59571)/PID:70835 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'

and without the lru cache:

❯ tail -f /Users/panagiotis.bailis/.rally/logs/rally.log | grep "Initializing KnnVectorStore" ─╯ 2023-11-24 16:39:10,13 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:39:10,52 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:39:10,89 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:39:10,126 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:39:10,163 ActorAddr-(T|:62100)/PID:75072 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:39:18,794 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:39:41,863 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:40:06,430 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:40:27,997 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector' 2023-11-24 16:40:51,577 ActorAddr-(T|:62150)/PID:75085 dense_vector.track INFO Initializing KnnVectorStore for queries file: '/Users/panagiotis.bailis/workspace/github/pmpailis/rally-tracks/dense_vector/queries.json' and vector field: 'vector'

Would it make sense to add the params on the KnnRecallParamSource constructor and then read the file "lazily" when actually calling the params and passing it to the runner? This would lead to the file been read only once instead for each operation. I've made some changes that address this so whenever you have some time could you please give it another 👀 ? 🙏

Hmm, we shouldn't be constructing a KnnRecallParamSource more than once per worker (actor):
https://github.com/elastic/rally/blob/c1f04a368f5efaa28128dc5341b7082a62a13a34/esrally/driver/driver.py#L1792-L1812

In any case this is your workload/track, so if this implementation makes sense then I have no problems with it, just wanted to point out the critical path bit.

Hmm, we shouldn't be constructing a KnnRecallParamSource more than once per worker (actor):

IIUC doesn't this codepath execute independently for each operation that the actor is assigned to, as we initialize a new AsyncIoAdapter for each task allocation - and the params_per_task are independent for each run (which makes sense since we have different params per operation)? 🤔 Maybe I'm missing something but - do we already account somehow for having commonly shared resources for similar operations?

In addition to that, we also seem to generate a KnnRecallParamSource when preparing the track and checking for used corpora which also introduces (for the specific case) unnecessary I/O load.

(Note: I think that the same holds for KnnParamSource as well)

IIUC doesn't this codepath execute independently for each operation that the actor is assigned to, as we initialize a new AsyncIoAdapter for each task allocation

Yes, sorry, I glossed over that part, I should have said "once per worker, per task".

and the params_per_task are independent for each run (which makes sense since we have different params per operation)? 🤔 Maybe I'm missing something but - do we already account somehow for having commonly shared resources for similar operations?

The AsyncExecutor is the part the actually executes the runner, and that gets its per-request parameters from the respective ScheduleHandle.

I took a deeper look at the code but I lack almost all of the required domain knowledge here to make any good advice. I'll leave it up to you how you want to handle it, i.e. you don't need my approval to merge 😆 as these are just general comments.

…rce and passing it as arg

benwtrent · 2023-11-28T00:36:28Z

Why aren't we statically initializing a list of nearest neighbors?

We already know the dataset and the queries.

pmpailis · 2023-11-28T09:02:37Z

Why aren't we statically initializing a list of nearest neighbors?

We already know the dataset and the queries.

To have a static list of vectors we would have to either hash them to a unique value or use a custom set ID during indexing, so that we can keep the references, right? There are a couple of reasons that I opted for a dynamically computed list:

the execution cost is not prohibitive as it is something that we already do multiple times
to avoid potential sync/misconfiguration issues issues with ids / hashes / similarity metrics etc, and make it easier for someone to add a new track with recall metrics enabled
to make it easier to support a random set of queries from the dataset (or true random) and potential different mappings if we decide to do so - instead of having a true-neighbors file for each combination

Not something that is of immediate "necessity" but it will make it a bit easier for the users to add & adjust as to their needs. Maybe we could also add an option to the recall operations to specify whether the true neighbors are to be computed on the fly, or be read from an external resource - so that we can have both the flexibility if needed as well as address performance costs for huge datasets. WDYT?

benwtrent · 2023-11-28T12:21:25Z

I think this is ok @pmpailis, but it does seem too complicated. But as long as it works :)

mayya-sharipova · 2023-11-28T14:17:01Z

I like the current approach of @pmpailis .
I think when rally adds an ability to have docs with custom IDs, we can provide a static list of expected IDs, but for now the current approach seems to be the best.

pmpailis · 2023-11-29T16:48:32Z

@benwtrent if you're also ok with this - we could merge this and if needed we can track the work for offline true neighbors in a separate PR/issue.

benwtrent · 2023-11-29T16:54:42Z

I am fine with the PR as long as @b-deam approves.

b-deam

Rubber stamped!

pmpailis · 2023-11-30T13:03:41Z

It seems that a number of PRs were closed when this one was merged 😞

#340
#353
#376
#395
#440
#459
#464
#500
#505
#512
#520
#522
#525
#527

mayya-sharipova · 2023-11-30T14:03:28Z

It seems that a number of PRs were closed when this one was merged 😞

It so strange how automation misbehaved.

gbanasiak · 2023-12-01T09:17:33Z

The root cause of #518 (comment) was addressed, and affected PRs re-opened.

…l operations (elastic#518)

pmpailis added 6 commits November 6, 2023 16:04

updating

452645f

adding KnnVectorStore and KnnRecallProcessor

608d722

updating

3fabe8b

updating

0578db2

updating

7de3aa1

updating

dbdba67

pmpailis added the enhancement label Nov 20, 2023

pmpailis requested review from benwtrent, jdconrad and mayya-sharipova and removed request for jdconrad November 20, 2023 12:28

pmpailis added 3 commits November 20, 2023 14:36

linting

0220e91

linting

50766d5

linting

585c1ba

pmpailis removed the request for review from benwtrent November 20, 2023 13:58

pmpailis added 2 commits November 20, 2023 16:14

python 3.8 compatibility

3b144c9

linting

b31cfcd

pmpailis commented Nov 20, 2023

View reviewed changes

mayya-sharipova reviewed Nov 21, 2023

View reviewed changes

b-deam requested a review from a team November 22, 2023 06:40

pmpailis added 7 commits November 22, 2023 09:00

addressing PR comments - removing KnnRecallProcessor & setting defaul…

4397a60

…t max_k | passing file & vector field as params

lint

0d5ed44

updating

0b65b02

updating

efc283d

updating

349c4ab

lint

1d04d97

lint

5a41e3d

updating

c226899

mayya-sharipova approved these changes Nov 22, 2023

View reviewed changes

b-deam reviewed Nov 23, 2023

View reviewed changes

pmpailis added 2 commits November 24, 2023 18:02

addressing PR comments - moving knn_vector_store to KnnRecallParamSou…

134baa1

…rce and passing it as arg

minor refactoring

faf17fd

b-deam approved these changes Nov 29, 2023

View reviewed changes

pmpailis merged commit 8325625 into elastic:master Nov 30, 2023

pmpailis mentioned this pull request Nov 30, 2023

ESQL: Load 500 rows #520

Merged

pmpailis mentioned this pull request Nov 30, 2023

Add index_sorting, synthetic_source_mode and force_merge_max_num_segments track parameters to elastic/logs track #522

Merged

inqueue pushed a commit to inqueue/rally-tracks that referenced this pull request Dec 6, 2023

Adding KnnVectorStore to dense_vector track for speeding up knn-recal…

abc51ee

…l operations (elastic#518)

pmpailis mentioned this pull request Dec 11, 2023

Adding invalidate-vector-store param to knn-recall operations #540

Merged

		return [hit["_id"] for hit in script_query["hits"]["hits"]]


		class KnnVectorStore:

Conversation

pmpailis commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmpailis Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmpailis Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

pmpailis Nov 22, 2023

Choose a reason for hiding this comment

Uh oh!

pmpailis Nov 22, 2023

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

pmpailis commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmpailis commented Nov 22, 2023

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

b-deam Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmpailis Nov 24, 2023

Choose a reason for hiding this comment

Uh oh!

b-deam Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

pmpailis Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

b-deam Nov 28, 2023

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Nov 28, 2023

Uh oh!

pmpailis commented Nov 28, 2023

Uh oh!

benwtrent commented Nov 28, 2023

Uh oh!

mayya-sharipova commented Nov 28, 2023

Uh oh!

pmpailis commented Nov 29, 2023

Uh oh!

benwtrent commented Nov 29, 2023

Uh oh!

b-deam left a comment

Choose a reason for hiding this comment

Uh oh!

pmpailis commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented Nov 30, 2023

Uh oh!

gbanasiak commented Dec 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pmpailis commented Nov 20, 2023 •

edited

Loading

pmpailis Nov 20, 2023 •

edited

Loading

pmpailis commented Nov 22, 2023 •

edited

Loading

b-deam Nov 23, 2023 •

edited

Loading

pmpailis Nov 27, 2023 •

edited

Loading

pmpailis commented Nov 30, 2023 •

edited

Loading