Add ScaNN index backend with dtype-aware storage and tests by robro612 · Pull Request #195 · lightonai/pylate

robro612 · 2026-02-18T16:40:08Z

Summary

add a new ScaNN index backend and export it from pylate.indexes
support dtype-aware flattening/storage for fp16 and fp32, optional memory logging utilities, and scann optional dependencies in pyproject.toml
add focused ScaNN unit coverage for dtype handling, verbosity normalization, embedding retrieval-by-docid, and error paths

Test plan

python -m compileall pylate/indexes/scann.py tests/test_scann.py
python -m pytest tests/test_scann.py -q
Test e2e with python examples/evaluation/beir_dataset.py --index_type scann --dataset dataset_name nfcorpus

This adds ScaNN index support with fp16/fp32-preserving flattening, shared index utilities, scann extras, and focused tests for verbosity, dtype handling, and embedding retrieval behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

…s test - Run ruff format/lint on scann.py and test_scann.py - Replace per-token Python loop in __call__() result construction with np.split for vectorized reshaping - Add test for numpy document embedding input (float16/float32) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

raphaelsty · 2026-02-26T09:57:36Z

Hi @robro612, this is a cool PR, LGTM, would it be possible to have a small benchmark on smallest dataset of BEIR benchmark like scifact and few small other to assert that the ndcg@10 is relevant ? As well as the QPS ?

Copilot

Pull request overview

This pull request adds a new ScaNN (Scalable Nearest Neighbors) index backend to pylate, providing an alternative approximate nearest neighbor search option for ColBERT retrieval. The implementation includes dtype-aware storage supporting both fp16 and fp32 embeddings, optional memory logging utilities, and comprehensive test coverage.

Changes:

Adds ScaNN index implementation with auto-tuning parameters and optional autopilot mode
Introduces shared utility functions (reshape_embeddings, log_memory) in pylate/indexes/utils.py
Adds ScaNN optional dependencies with psutil for memory tracking
Includes comprehensive unit tests covering dtype handling, verbosity configuration, and error paths
Updates evaluation example script to support ScaNN index type

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pylate/indexes/scann.py	Complete ScaNN index implementation with build, save, load, query, and embedding retrieval functionality
pylate/indexes/utils.py	Shared utility functions for embedding reshaping and memory logging
pylate/indexes/init.py	Exports ScaNN class from indexes module
pyproject.toml	Adds scann and psutil dependencies to optional dependencies
tests/test_scann.py	Comprehensive test suite for ScaNN functionality including dtype handling and error cases
examples/evaluation/beir_dataset.py	Adds ScaNN as an index type option with fp16 model support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T10:06:32Z

pylate/indexes/scann.py

+
+    def __init__(
+        self,
+        name: str | None = "ScaNN_index",


Parameter naming inconsistency: The ScaNN class uses name as a parameter, but all other index classes in this codebase (PLAID, Voyager, FastPlaid, StanfordPlaid) use index_name. This breaks API consistency. The parameter should be renamed to index_name to match the established pattern.

examples/evaluation/beir_dataset.py

Copilot · 2026-02-26T10:06:33Z

pylate/indexes/scann.py

+        override: bool = False,
+        verbose_level: str | None = None,
+    ) -> None:
+        self.name = name


The name attribute is stored but never used anywhere in the implementation. Looking at line 104, self.name = name is set, but it's only used in _get_index_path() (line 260). However, the docstring on line 43-44 states it's "The name of the index collection", suggesting it should be consistently used and documented.

Copilot · 2026-02-26T10:06:33Z

pylate/indexes/scann.py

+                emb_np = emb.to(
+                    "cpu",
+                    dtype=torch.float16 if np_dtype == np.float16 else torch.float32,
+                ).numpy()


The to() conversion specifies a dtype that will always match the input tensor's dtype due to the check on line 562-566. The dtype parameter in the to() call is redundant since line 562-566 ensures all embeddings already have the same dtype as np_dtype. Consider simplifying to emb.to("cpu").numpy() or document why explicit dtype conversion is needed here.

Suggested change

emb_np = emb.to(

"cpu",

dtype=torch.float16 if np_dtype == np.float16 else torch.float32,

).numpy()

emb_np = emb.to("cpu").numpy()

robro612 · 2026-02-26T13:46:27Z

@raphaelsty certainly, I'll have those numbers as a consequence of testing the training PRs soon - keep in mind ScaNN is cpu-only so it's rather slow compared to FastPLAID, but was easier for me to use than Voyage even with the huge memory overhead of storing the flattened embs. I'll also take a look at the fixes the bot suggested ^.

Make batch_size a no-op default arg in ScaNN (requires all docs at once, keep argument to maintain parity with other index classes) Type/Docstring annotation fixes in retriever class

robro612 · 2026-03-02T21:28:14Z

Update: Copilot fixes + BEIR benchmarks

Changes in this push

Renamed name param to index_name to match convention used by PLAID/Voyager/FastPlaid
Simplified redundant .to(dtype=...) in add_documents
Added default value for batch_size in add_documents (no-op for ScaNN but required by Base interface)
Fixed retrieve.ColBERT to properly route ScaNN through the candidate + rerank path (was incorrectly using the PLAID direct-return path)
Updated retriever type annotations (index: Base instead of Voyager | PLAID)
Added benchmark script (examples/evaluation/benchmark_beir.py)

Benchmark results

Model: lightonai/GTE-ModernColBERT-v1 (fp16), A100 GPU, k=10

nDCG@10

Dataset	#Docs	#Queries	PLAID	ScaNN
nfcorpus	3,633	323	0.3797	0.3796
fiqa	57,638	648	0.4520	0.4508
scifact	5,183	300	0.7622	0.7618
trec-covid	171,332	50	0.8391	0.8459

QPS (queries per second)

Dataset	PLAID	ScaNN
nfcorpus	23.0	31.3
fiqa	21.4	6.7
scifact	21.2	7.1
trec-covid	16.2	3.4

Notes

ndcg@10 is virtually identical between PLAID and ScaNN across all datasets (within 0.01)
QPS: PLAID is faster on larger datasets due to GPU-accelerated retrieval; ScaNN (CPU-only) is faster on small datasets where the GPU overhead dominates
ScaNN indexing is slower than PLAID on A100 (CPU-bound kmeans), but provides a simpler dependency story (pip-installable, no Rust/ColBERT compilation)
ScaNN is a good option for users who want an easy-to-install approximate index without the PLAID toolchain

raphaelsty · 2026-03-03T10:10:46Z

LGTM the MR is very clean @robro612, thank you for the evaluation results , I'll run the CI and then merge :)

NohTow

Hey!
Thanks for the amazing work and sorry for the delay in the review, busy days!

I've added a few comments, most of them are nit but I figured it could help making things a bit cleaner (also some probable stupid questions, but I'd rather ask and find out I'm dumb rather than merging errors because I did not ask!)

Besides those, I think my main "comment" is that I wonder whether we should merge the part about time/memory profiling. It's very nice of you to have added all of those and go the extra miles for benching the things, but I wonder if it's something we expect in the merged indexes.
On a related note, I wonder if we should merge examples/evaluation/benchmark_index_beir.py and if so, I do not think it should be in this folder imho

NohTow · 2026-03-03T10:49:31Z

examples/evaluation/beir_dataset.py

        document_length=300,
        query_length=query_len.get(dataset_name),
-    )
+    ).to(torch.float16)


probably a nit because fp16 is almost == to fp32 but I wonder if this should be an option
should be noted that i am using fp16 for most of my benches these days to save some memory for large datasets (i need to fix bf16 models that outputs fp32 because of numpy and thus have to be recasted)

NohTow · 2026-03-03T10:56:12Z

pylate/indexes/scann.py

+
+    def _build_searcher(self, embeddings: np.ndarray) -> None:
+        """Build the ScaNN searcher from embeddings (in-memory only)."""
+        build_start = time.time()


Do we still need those though?
It's to run the bench right? I wonder if we should let bench params within final PR

NohTow · 2026-03-03T10:58:21Z

pylate/indexes/scann.py

+                )
+
+        # Build ScaNN searcher
+        log_memory("Before scann.build()", self.verbose)


A bit of a broad comment around the whole PR, but I wonder if we should leave time/memory profiling in the final merged thing

NohTow · 2026-03-03T11:00:17Z

pylate/indexes/scann.py

+                    logger.warning(
+                        f"[ScaNN]   WARNING: Manual parameters provided but will be ignored: num_leaves={self.num_leaves}, num_leaves_to_search={self.num_leaves_to_search}, training_sample_size={self.training_sample_size}"
+                    )
+        else:


Sorry if dumb question but from my understanding here we were to use autopilot when params not set.
Seems like you are defining some defaults but not setting self.autopilot to True.
Were you referring to "auto tune" like use default or am I missing anything?

autopilot is a config setting directly in the ScaNN library. AFAICT it sets reasonable settings, but I don't know how exactly. Recalling my experiments, It's slower/more accurate than the defaults that I set which come directly from the XTR inference notebook.

NohTow · 2026-03-03T11:01:36Z

pylate/indexes/scann.py

+            )
+
+        metadata_path = index_path / "metadata.json"
+        doc_id_mapping_path = index_path / "doc_id_to_embedding_range.tsv"


Am I a pain asking whether we could use the same type of processing than for the other indexes?
We went from sqlitdict to pickle, which should be pretty easy to implement (tsv -> dict pickled)

pylate/indexes/scann.py

NohTow · 2026-03-03T11:13:07Z

pylate/indexes/scann.py

+    def __call__(
+        self,
+        queries_embeddings: list[list[int | float]],
+        k: int = 5,


biggest nit of my life but default k for other indexes is 10

NohTow · 2026-03-03T11:13:54Z

pylate/indexes/scann.py

+            If subset is provided (not yet implemented).
+
+        """
+        if subset is not None:


Wonder if we should expose the param at all then, it's not exposed for stanford plaid only fastplaid

pylate/indexes/scann.py

NohTow · 2026-03-03T11:15:33Z

pyproject.toml

    "pytest-xdist >=3.6.0",
    "pytest-rerunfailures >= 15.0.0",
    "pytest >= 8.2.1",
+    "psutil >= 7.2.2",


Why do we need it in main dep?

tests/test_scann.py

- Remove profiling/timing code (time.time, log_memory) from ScaNN index - Remove psutil dependency from scann optional and dev deps - Remove benchmark script from PR - Switch doc_id_to_embedding_range storage from TSV to pickle - Move _to_np_dtype helper to utils module as np_dtype_for - Change default k from 5 to 10 to match other indexes - Remove subset param from ScaNN.__call__ (not implemented) - Add comment explaining NaN distances from ScaNN - Add save/load round-trip test for pickle serialization

robro612 · 2026-03-03T15:27:38Z

Addressing review comments

Thanks for the thorough review @NohTow! Here's what was addressed:

Comment	Resolution
Remove profiling/timing code	Removed all `time.time()` profiling, `log_memory()` calls, and `_log_retrieve()` method
Remove psutil dependency	Removed from both `scann` optional deps and `dev` deps
Remove benchmark script from PR	Removed (easy to recreate if needed from the BEIR boilerplate)
Use pickle instead of TSV for doc mapping	Switched `doc_id_to_embedding_range` serialization to pickle
Get metadata from ScaNN object directly	Investigated — `searcher.config()` has build params protobuf but not our app-level flags (`use_autopilot`, `store_embeddings`), so JSON sidecar is still needed / desirable IMO for clear visibility without load.
Autopilot clarification	`use_autopilot` is a ScaNN config flag; when True it calls `.autopilot()` and ignores manual params — handled correctly
Move nested function to utils	Extracted `_to_np_dtype` → `np_dtype_for()` in `pylate/indexes/utils.py`
`remove_document` filtering	Prefer not to patch behavior the underlying index doesn't offer
Default `k=10`	Fixed to match other indexes
`subset` param	Removed entirely (matches Voyager/StanfordPLAID which also don't expose it)
Add NaN comment	Added explanation that ScaNN can't always complete top-k
In-memory array for reconstructed embeddings	Kept as-is

Also added a save/load round-trip test covering the pickle serialization.

pylate/indexes/scann.py

Better comment to not conflate our defaults with .autopilot() ScaNN feature Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>

NohTow · 2026-03-05T16:24:19Z

Thanks for the PR!

robro612 mentioned this pull request Feb 18, 2026

Add XTR scoring #86

Open

This was referenced Feb 18, 2026

Make ColBERT prefix tokens optional for prefix-free models robro612/pylate#1

Closed

Make ColBERT prefix tokens optional for prefix-free models #196

Open

Add XTR retriever with token-level scoring and imputation #197

Closed

raphaelsty requested a review from Copilot February 26, 2026 09:57

Copilot started reviewing on behalf of raphaelsty February 26, 2026 09:58 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

robro612 added 2 commits March 2, 2026 13:40

rename ScaNN.name __init__ var to ScaNN.index_name

b99740c

Add index benchmark script to get ndcg / QPS for small datasets.

82656b5

Make batch_size a no-op default arg in ScaNN (requires all docs at once, keep argument to maintain parity with other index classes) Type/Docstring annotation fixes in retriever class

NohTow reviewed Mar 3, 2026

View reviewed changes

tests/test_scann.py Show resolved Hide resolved

NohTow reviewed Mar 5, 2026

View reviewed changes

pylate/indexes/scann.py Outdated Show resolved Hide resolved

Antoine Chaffin and others added 3 commits March 5, 2026 16:08

Fix Python OOD fuckery

8b2ab63

Reverting the bench to only PLAID

446dcb8

Update pylate/indexes/scann.py

0b0304b

Better comment to not conflate our defaults with .autopilot() ScaNN feature Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>

NohTow merged commit d350714 into lightonai:main Mar 5, 2026
10 checks passed

Conversation

robro612 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

raphaelsty commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

robro612 commented Feb 26, 2026

Uh oh!

robro612 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update: Copilot fixes + BEIR benchmarks

Changes in this push

Benchmark results

nDCG@10

QPS (queries per second)

Notes

Uh oh!

raphaelsty commented Mar 3, 2026

Uh oh!

NohTow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robro612 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressing review comments

Uh oh!

Uh oh!

Uh oh!

NohTow commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robro612 commented Feb 18, 2026 •

edited

Loading

robro612 commented Mar 2, 2026 •

edited

Loading

robro612 commented Mar 3, 2026 •

edited

Loading