Skip to content

Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors#541

Open
vigyasharma wants to merge 7 commits intomikemccand:mainfrom
vigyasharma:stream
Open

Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors#541
vigyasharma wants to merge 7 commits intomikemccand:mainfrom
vigyasharma:stream

Conversation

@vigyasharma
Copy link
Copy Markdown
Collaborator

Found a few issues in the script with using streaming mode and v3 wiki vectors. This change adds fixes and changes the script to use streaming mode with Cohere wiki v3 en vectors.

@vigyasharma vigyasharma changed the title Fix infer_toke_vectors for streaming and cohere wiki v3 vectors Fix infer_token_vectors_cohere for streaming and cohere wiki v3 vectors Feb 26, 2026
Copy link
Copy Markdown
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vigyasharma -- but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling (the unshuffled Cohere v2 was altering benchmark results, horrible hidden corpus bias because we didn't shuffle before) -- I should have removed this older one maybe, or at least added a comment explaining that. See https://github.com/mikemccand/luceneutil/blob/main/src/python/cohere-v3-README.txt

Let's still merge your improvements!

@vigyasharma
Copy link
Copy Markdown
Collaborator Author

but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling

Whoa, I completely missed this, thanks!

I've added dataset shuffling to the tool. Note that shuffling breaks parent-join benchmarks as we need the dataset ordered to keep all paragraphs of an article (the children) within the same block. So I added a "--no-shuffle" flag that disables shuffling. And the script only generates the meta file when shuffling is disabled.

@mikemccand
Copy link
Copy Markdown
Owner

but note that we have a whole different (annoyingly) Python tool for Cohere v3 since we needed shuffling

Whoa, I completely missed this, thanks!

I've added dataset shuffling to the tool. Note that shuffling breaks parent-join benchmarks as we need the dataset ordered to keep all paragraphs of an article (the children) within the same block. So I added a "--no-shuffle" flag that disables shuffling. And the script only generates the meta file when shuffling is disabled.

Wait -- the separate Cohere v3 upgrade tooling (see cohere-v3-README.txt) already shuffles, and does so carefully so that the parent/child relation is preserved.

It produces four outputs (query and documents X shuffle-wikids, shuffle-vectors). The -coalesced- variants preserve the parent/child. I can upload the -coalesced- versions to Cloudflare if you want, or if you run the v3 tooling and hit weird exceptions, let's debug?

Let's maybe delete this old tool and make it clear we are using the new one...


written = 0
for row in ds_iter:
q_emb = np.array(row["emb"], dtype=np.float32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this copies the row bytes to a numpy array.
We can avoid extra the copy by doing ds = ds.with_format("numpy") so that initially allocated array is already a numpy array.

emb_dims = len(row["emb"])
print(f"embedding dimensions = {emb_dims}")
assert emb_dims == DIMENSIONS, f"Dataset embedding dimensions: {emb_dims} do not match configured dimensions: {DIMENSIONS}"
emb = np.array(row["emb"], dtype=np.float32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knnPerfTest.py assumes little-endian, not native byte-order.

We should also verify the byte-order of the incoming floats.
e.g. np.dtype("<f4") (little-endian float32) instead of np.float32 (native byte-order float32).

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make it explicitly little-endian. It's at least KnnGraphTester.java that is making this assumption...

It "just works" because these days CPUs (Intel, AMD, ARM) are all little endian? Actually I think ARM are "bi-endian" and can toggle the endian-ness on boot (wild).

@abernardi597 do the Cohere v3 tools do this correctly (write little endian, not native-which-happens-to-be-little-endian)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just asserts that each embedding is little endian here, but does not attempt to convert if it isn't

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants