Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors#541
Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors#541vigyasharma wants to merge 7 commits intomikemccand:mainfrom
Conversation
mikemccand
left a comment
There was a problem hiding this comment.
Thanks @vigyasharma -- but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling (the unshuffled Cohere v2 was altering benchmark results, horrible hidden corpus bias because we didn't shuffle before) -- I should have removed this older one maybe, or at least added a comment explaining that. See https://github.com/mikemccand/luceneutil/blob/main/src/python/cohere-v3-README.txt
Let's still merge your improvements!
Whoa, I completely missed this, thanks! I've added dataset shuffling to the tool. Note that shuffling breaks parent-join benchmarks as we need the dataset ordered to keep all paragraphs of an article (the children) within the same block. So I added a |
Wait -- the separate Cohere v3 upgrade tooling (see It produces four outputs (query and documents X shuffle-wikids, shuffle-vectors). The Let's maybe delete this old tool and make it clear we are using the new one... |
|
|
||
| written = 0 | ||
| for row in ds_iter: | ||
| q_emb = np.array(row["emb"], dtype=np.float32) |
There was a problem hiding this comment.
I believe this copies the row bytes to a numpy array.
We can avoid extra the copy by doing ds = ds.with_format("numpy") so that initially allocated array is already a numpy array.
| emb_dims = len(row["emb"]) | ||
| print(f"embedding dimensions = {emb_dims}") | ||
| assert emb_dims == DIMENSIONS, f"Dataset embedding dimensions: {emb_dims} do not match configured dimensions: {DIMENSIONS}" | ||
| emb = np.array(row["emb"], dtype=np.float32) |
There was a problem hiding this comment.
knnPerfTest.py assumes little-endian, not native byte-order.
We should also verify the byte-order of the incoming floats.
e.g. np.dtype("<f4") (little-endian float32) instead of np.float32 (native byte-order float32).
There was a problem hiding this comment.
+1 to make it explicitly little-endian. It's at least KnnGraphTester.java that is making this assumption...
It "just works" because these days CPUs (Intel, AMD, ARM) are all little endian? Actually I think ARM are "bi-endian" and can toggle the endian-ness on boot (wild).
@abernardi597 do the Cohere v3 tools do this correctly (write little endian, not native-which-happens-to-be-little-endian)?
There was a problem hiding this comment.
It just asserts that each embedding is little endian here, but does not attempt to convert if it isn't
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Found a few issues in the script with using streaming mode and v3 wiki vectors. This change adds fixes and changes the script to use streaming mode with Cohere wiki v3 en vectors.