Skip to content

Add ScaNN index backend with dtype-aware storage and tests#195

Merged
NohTow merged 8 commits intolightonai:mainfrom
robro612:pylate-xtr-pr
Mar 5, 2026
Merged

Add ScaNN index backend with dtype-aware storage and tests#195
NohTow merged 8 commits intolightonai:mainfrom
robro612:pylate-xtr-pr

Conversation

@robro612
Copy link
Copy Markdown
Contributor

@robro612 robro612 commented Feb 18, 2026

Summary

  • add a new ScaNN index backend and export it from pylate.indexes
  • support dtype-aware flattening/storage for fp16 and fp32, optional memory logging utilities, and scann optional dependencies in pyproject.toml
  • add focused ScaNN unit coverage for dtype handling, verbosity normalization, embedding retrieval-by-docid, and error paths

Test plan

  • python -m compileall pylate/indexes/scann.py tests/test_scann.py
  • python -m pytest tests/test_scann.py -q
  • Test e2e with python examples/evaluation/beir_dataset.py --index_type scann --dataset dataset_name nfcorpus

This adds ScaNN index support with fp16/fp32-preserving flattening, shared index utilities, scann extras, and focused tests for verbosity, dtype handling, and embedding retrieval behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
@robro612 robro612 mentioned this pull request Feb 18, 2026
…s test

- Run ruff format/lint on scann.py and test_scann.py
- Replace per-token Python loop in __call__() result construction with
  np.split for vectorized reshaping
- Add test for numpy document embedding input (float16/float32)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raphaelsty
Copy link
Copy Markdown
Collaborator

Hi @robro612, this is a cool PR, LGTM, would it be possible to have a small benchmark on smallest dataset of BEIR benchmark like scifact and few small other to assert that the ndcg@10 is relevant ? As well as the QPS ?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new ScaNN (Scalable Nearest Neighbors) index backend to pylate, providing an alternative approximate nearest neighbor search option for ColBERT retrieval. The implementation includes dtype-aware storage supporting both fp16 and fp32 embeddings, optional memory logging utilities, and comprehensive test coverage.

Changes:

  • Adds ScaNN index implementation with auto-tuning parameters and optional autopilot mode
  • Introduces shared utility functions (reshape_embeddings, log_memory) in pylate/indexes/utils.py
  • Adds ScaNN optional dependencies with psutil for memory tracking
  • Includes comprehensive unit tests covering dtype handling, verbosity configuration, and error paths
  • Updates evaluation example script to support ScaNN index type

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pylate/indexes/scann.py Complete ScaNN index implementation with build, save, load, query, and embedding retrieval functionality
pylate/indexes/utils.py Shared utility functions for embedding reshaping and memory logging
pylate/indexes/init.py Exports ScaNN class from indexes module
pyproject.toml Adds scann and psutil dependencies to optional dependencies
tests/test_scann.py Comprehensive test suite for ScaNN functionality including dtype handling and error cases
examples/evaluation/beir_dataset.py Adds ScaNN as an index type option with fp16 model support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


def __init__(
self,
name: str | None = "ScaNN_index",
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter naming inconsistency: The ScaNN class uses name as a parameter, but all other index classes in this codebase (PLAID, Voyager, FastPlaid, StanfordPlaid) use index_name. This breaks API consistency. The parameter should be renamed to index_name to match the established pattern.

Copilot uses AI. Check for mistakes.
override: bool = False,
verbose_level: str | None = None,
) -> None:
self.name = name
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name attribute is stored but never used anywhere in the implementation. Looking at line 104, self.name = name is set, but it's only used in _get_index_path() (line 260). However, the docstring on line 43-44 states it's "The name of the index collection", suggesting it should be consistently used and documented.

Copilot uses AI. Check for mistakes.
Comment on lines +555 to +558
emb_np = emb.to(
"cpu",
dtype=torch.float16 if np_dtype == np.float16 else torch.float32,
).numpy()
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to() conversion specifies a dtype that will always match the input tensor's dtype due to the check on line 562-566. The dtype parameter in the to() call is redundant since line 562-566 ensures all embeddings already have the same dtype as np_dtype. Consider simplifying to emb.to("cpu").numpy() or document why explicit dtype conversion is needed here.

Suggested change
emb_np = emb.to(
"cpu",
dtype=torch.float16 if np_dtype == np.float16 else torch.float32,
).numpy()
emb_np = emb.to("cpu").numpy()

Copilot uses AI. Check for mistakes.
@robro612
Copy link
Copy Markdown
Contributor Author

@raphaelsty certainly, I'll have those numbers as a consequence of testing the training PRs soon - keep in mind ScaNN is cpu-only so it's rather slow compared to FastPLAID, but was easier for me to use than Voyage even with the huge memory overhead of storing the flattened embs. I'll also take a look at the fixes the bot suggested ^.

robro612 added 2 commits March 2, 2026 13:40
Make batch_size a no-op default arg in ScaNN (requires all docs at once, keep argument to maintain parity with other index classes)
Type/Docstring annotation fixes in retriever class
@robro612
Copy link
Copy Markdown
Contributor Author

robro612 commented Mar 2, 2026

Update: Copilot fixes + BEIR benchmarks

Changes in this push

  • Renamed name param to index_name to match convention used by PLAID/Voyager/FastPlaid
  • Simplified redundant .to(dtype=...) in add_documents
  • Added default value for batch_size in add_documents (no-op for ScaNN but required by Base interface)
  • Fixed retrieve.ColBERT to properly route ScaNN through the candidate + rerank path (was incorrectly using the PLAID direct-return path)
  • Updated retriever type annotations (index: Base instead of Voyager | PLAID)
  • Added benchmark script (examples/evaluation/benchmark_beir.py)

Benchmark results

Model: lightonai/GTE-ModernColBERT-v1 (fp16), A100 GPU, k=10

nDCG@10

Dataset #Docs #Queries PLAID ScaNN
nfcorpus 3,633 323 0.3797 0.3796
fiqa 57,638 648 0.4520 0.4508
scifact 5,183 300 0.7622 0.7618
trec-covid 171,332 50 0.8391 0.8459

QPS (queries per second)

Dataset PLAID ScaNN
nfcorpus 23.0 31.3
fiqa 21.4 6.7
scifact 21.2 7.1
trec-covid 16.2 3.4

Notes

  • ndcg@10 is virtually identical between PLAID and ScaNN across all datasets (within 0.01)
  • QPS: PLAID is faster on larger datasets due to GPU-accelerated retrieval; ScaNN (CPU-only) is faster on small datasets where the GPU overhead dominates
  • ScaNN indexing is slower than PLAID on A100 (CPU-bound kmeans), but provides a simpler dependency story (pip-installable, no Rust/ColBERT compilation)
  • ScaNN is a good option for users who want an easy-to-install approximate index without the PLAID toolchain

@raphaelsty
Copy link
Copy Markdown
Collaborator

LGTM the MR is very clean @robro612, thank you for the evaluation results , I'll run the CI and then merge :)

Copy link
Copy Markdown
Collaborator

@NohTow NohTow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!
Thanks for the amazing work and sorry for the delay in the review, busy days!

I've added a few comments, most of them are nit but I figured it could help making things a bit cleaner (also some probable stupid questions, but I'd rather ask and find out I'm dumb rather than merging errors because I did not ask!)

Besides those, I think my main "comment" is that I wonder whether we should merge the part about time/memory profiling. It's very nice of you to have added all of those and go the extra miles for benching the things, but I wonder if it's something we expect in the merged indexes.
On a related note, I wonder if we should merge examples/evaluation/benchmark_index_beir.py and if so, I do not think it should be in this folder imho

document_length=300,
query_length=query_len.get(dataset_name),
)
).to(torch.float16)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a nit because fp16 is almost == to fp32 but I wonder if this should be an option
should be noted that i am using fp16 for most of my benches these days to save some memory for large datasets (i need to fix bf16 models that outputs fp32 because of numpy and thus have to be recasted)


def _build_searcher(self, embeddings: np.ndarray) -> None:
"""Build the ScaNN searcher from embeddings (in-memory only)."""
build_start = time.time()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need those though?
It's to run the bench right? I wonder if we should let bench params within final PR

)

# Build ScaNN searcher
log_memory("Before scann.build()", self.verbose)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a broad comment around the whole PR, but I wonder if we should leave time/memory profiling in the final merged thing

logger.warning(
f"[ScaNN] WARNING: Manual parameters provided but will be ignored: num_leaves={self.num_leaves}, num_leaves_to_search={self.num_leaves_to_search}, training_sample_size={self.training_sample_size}"
)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if dumb question but from my understanding here we were to use autopilot when params not set.
Seems like you are defining some defaults but not setting self.autopilot to True.
Were you referring to "auto tune" like use default or am I missing anything?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autopilot is a config setting directly in the ScaNN library. AFAICT it sets reasonable settings, but I don't know how exactly. Recalling my experiments, It's slower/more accurate than the defaults that I set which come directly from the XTR inference notebook.

)

metadata_path = index_path / "metadata.json"
doc_id_mapping_path = index_path / "doc_id_to_embedding_range.tsv"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I a pain asking whether we could use the same type of processing than for the other indexes?
We went from sqlitdict to pickle, which should be pretty easy to implement (tsv -> dict pickled)

def __call__(
self,
queries_embeddings: list[list[int | float]],
k: int = 5,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

biggest nit of my life but default k for other indexes is 10

If subset is provided (not yet implemented).

"""
if subset is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we should expose the param at all then, it's not exposed for stanford plaid only fastplaid

pyproject.toml Outdated
"pytest-xdist >=3.6.0",
"pytest-rerunfailures >= 15.0.0",
"pytest >= 8.2.1",
"psutil >= 7.2.2",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it in main dep?

- Remove profiling/timing code (time.time, log_memory) from ScaNN index
- Remove psutil dependency from scann optional and dev deps
- Remove benchmark script from PR
- Switch doc_id_to_embedding_range storage from TSV to pickle
- Move _to_np_dtype helper to utils module as np_dtype_for
- Change default k from 5 to 10 to match other indexes
- Remove subset param from ScaNN.__call__ (not implemented)
- Add comment explaining NaN distances from ScaNN
- Add save/load round-trip test for pickle serialization
@robro612
Copy link
Copy Markdown
Contributor Author

robro612 commented Mar 3, 2026

Addressing review comments

Thanks for the thorough review @NohTow! Here's what was addressed:

Comment Resolution
Remove profiling/timing code Removed all time.time() profiling, log_memory() calls, and _log_retrieve() method
Remove psutil dependency Removed from both scann optional deps and dev deps
Remove benchmark script from PR Removed (easy to recreate if needed from the BEIR boilerplate)
Use pickle instead of TSV for doc mapping Switched doc_id_to_embedding_range serialization to pickle
Get metadata from ScaNN object directly Investigated — searcher.config() has build params protobuf but not our app-level flags (use_autopilot, store_embeddings), so JSON sidecar is still needed / desirable IMO for clear visibility without load.
Autopilot clarification use_autopilot is a ScaNN config flag; when True it calls .autopilot() and ignores manual params — handled correctly
Move nested function to utils Extracted _to_np_dtypenp_dtype_for() in pylate/indexes/utils.py
remove_document filtering Prefer not to patch behavior the underlying index doesn't offer
Default k=10 Fixed to match other indexes
subset param Removed entirely (matches Voyager/StanfordPLAID which also don't expose it)
Add NaN comment Added explanation that ScaNN can't always complete top-k
In-memory array for reconstructed embeddings Kept as-is

Also added a save/load round-trip test covering the pickle serialization.

Antoine Chaffin and others added 3 commits March 5, 2026 16:08
Better comment to not conflate our defaults with .autopilot() ScaNN feature

Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>
@NohTow NohTow merged commit d350714 into lightonai:main Mar 5, 2026
10 checks passed
@NohTow
Copy link
Copy Markdown
Collaborator

NohTow commented Mar 5, 2026

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants