Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors by vigyasharma · Pull Request #541 · mikemccand/luceneutil

vigyasharma · 2026-02-26T21:37:44Z

Found a few issues in the script with using streaming mode and v3 wiki vectors. This change adds fixes and changes the script to use streaming mode with Cohere wiki v3 en vectors.

mikemccand

Thanks @vigyasharma -- but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling (the unshuffled Cohere v2 was altering benchmark results, horrible hidden corpus bias because we didn't shuffle before) -- I should have removed this older one maybe, or at least added a comment explaining that. See https://github.com/mikemccand/luceneutil/blob/main/src/python/cohere-v3-README.txt

Let's still merge your improvements!

vigyasharma · 2026-02-27T19:49:37Z

but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling

Whoa, I completely missed this, thanks!

I've added dataset shuffling to the tool. Note that shuffling breaks parent-join benchmarks as we need the dataset ordered to keep all paragraphs of an article (the children) within the same block. So I added a "--no-shuffle" flag that disables shuffling. And the script only generates the meta file when shuffling is disabled.

mikemccand · 2026-03-02T11:38:51Z

but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling

Whoa, I completely missed this, thanks!

I've added dataset shuffling to the tool. Note that shuffling breaks parent-join benchmarks as we need the dataset ordered to keep all paragraphs of an article (the children) within the same block. So I added a "--no-shuffle" flag that disables shuffling. And the script only generates the meta file when shuffling is disabled.

Wait -- the separate Cohere v3 upgrade tooling (see cohere-v3-README.txt) already shuffles, and does so carefully so that the parent/child relation is preserved.

It produces four outputs (query and documents X shuffle-wikids, shuffle-vectors). The -coalesced- variants preserve the parent/child. I can upload the -coalesced- versions to Cloudflare if you want, or if you run the v3 tooling and hit weird exceptions, let's debug?

Let's maybe delete this old tool and make it clear we are using the new one...

abernardi597 · 2026-03-04T18:19:16Z

src/python/infer_token_vectors_cohere.py

+
+    written = 0
+    for row in ds_iter:
+      q_emb = np.array(row["emb"], dtype=np.float32)


I believe this copies the row bytes to a numpy array.
We can avoid extra the copy by doing ds = ds.with_format("numpy") so that initially allocated array is already a numpy array.

abernardi597 · 2026-03-04T18:22:41Z

src/python/infer_token_vectors_cohere.py

+        emb_dims = len(row["emb"])
+        print(f"embedding dimensions = {emb_dims}")
+        assert emb_dims == DIMENSIONS, f"Dataset embedding dimensions: {emb_dims} do not match configured dimensions: {DIMENSIONS}"
+      emb = np.array(row["emb"], dtype=np.float32)


knnPerfTest.py assumes little-endian, not native byte-order.

We should also verify the byte-order of the incoming floats.
e.g. np.dtype("<f4") (little-endian float32) instead of np.float32 (native byte-order float32).

+1 to make it explicitly little-endian. It's at least KnnGraphTester.java that is making this assumption...

It "just works" because these days CPUs (Intel, AMD, ARM) are all little endian? Actually I think ARM are "bi-endian" and can toggle the endian-ness on boot (wild).

@abernardi597 do the Cohere v3 tools do this correctly (write little endian, not native-which-happens-to-be-little-endian)?

It just asserts that each embedding is little endian here, but does not attempt to convert if it isn't

github-actions · 2026-04-02T00:16:09Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

fix infer_toke_vectors for streaming and v3 cohere vectors

6ecd8d8

vigyasharma changed the title ~~Fix infer_toke_vectors for streaming and cohere wiki v3 vectors~~ Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors Feb 26, 2026

lint fail fix

e8883fe

mikemccand approved these changes Feb 27, 2026

View reviewed changes

vigyasharma added 4 commits February 27, 2026 11:28

add shuffle

c43e0fe

shuffle and add flag to disable shuffling

ef60820

add logs

8e03e99

lint fix

16806e9

update readme

4664925

abernardi597 suggested changes Mar 4, 2026

View reviewed changes

github-actions bot added the Stale label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors#541

Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors#541
vigyasharma wants to merge 7 commits intomikemccand:mainfrom
vigyasharma:stream

vigyasharma commented Feb 26, 2026

Uh oh!

mikemccand left a comment

Uh oh!

vigyasharma commented Feb 27, 2026

Uh oh!

mikemccand commented Mar 2, 2026

Uh oh!

abernardi597 Mar 4, 2026

Uh oh!

abernardi597 Mar 4, 2026

Uh oh!

mikemccand Mar 18, 2026

Uh oh!

abernardi597 Mar 18, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vigyasharma commented Feb 26, 2026

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

vigyasharma commented Feb 27, 2026

Uh oh!

mikemccand commented Mar 2, 2026

Uh oh!

abernardi597 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

abernardi597 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

mikemccand Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

abernardi597 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants