Skip to content

Make ColBERT prefix tokens optional for prefix-free models#1

Closed
robro612 wants to merge 89 commits intomainfrom
colbert-optional-prefix
Closed

Make ColBERT prefix tokens optional for prefix-free models#1
robro612 wants to merge 89 commits intomainfrom
colbert-optional-prefix

Conversation

@robro612
Copy link
Copy Markdown
Owner

@robro612 robro612 commented Feb 18, 2026

Summary

  • When query_prefix or document_prefix is None, no default marker tokens ([Q] / [D] ) are forced
  • Prefix insertion in tokenize() is skipped entirely when no prefix is configured
  • max_seq_length is no longer reduced by 1 when there's no prefix to insert
  • Token embedding resizing only runs when prefixes are actually defined

This enables loading models (e.g. XTR) that don't use ColBERT-style marker tokens without incorrect tokenization.

Note: The existing add_special_tokens constructor parameter was already a no-op (declared but never wired); this PR makes prefix behavior actually configurable via the existing query_prefix/document_prefix parameters.

Depends on lightonai#195

Test plan

  • Verify existing ColBERT models with prefixes still encode identically
  • Verify models loaded with query_prefix=None, document_prefix=None skip prefix insertion
  • Run python -m pytest tests/ -q for regression

NohTow and others added 30 commits October 2, 2024 15:08
* v1 API server

* Changing model_name to model

* Add a readme

* Add docstring

* Adding the server in the documentation

* Add server to norecursedirs, add doctstring for the endpoint method

* Enhancement of the documentation
Check if architectures is in the config before reading it
Make knowledge distillation processing compatible with DictDataset and Dataset.
* Use pytorch_model.bin to load stanford model instead of safetensor

* Bump ST version

* pin transformers version
* Creation of the PylateModelCard

* Fixing ruff top import

* Removing the example making tests fail

* Changing Sentence Transformer model default to PyLate

* Moving files to a dedicated subfolder

* Removing all the awfully copy-pasted redundant code to extend ST properly

* Adding init for hf_hub

* Changing docstring for automatic parsing documentation

* Consistency in the model args

* Adding save to tests to test saving/model card creation
Add safetensor OR bin loading logic + add loading tests
* Initial working draft

* Adding PyLate similarity function (maxsim) and use it as the default for ColBERT models

* Remove default score function in NanoBEIR evaluator (not needed anymore)

* Remove hardcoded similarity function in the model card template

* Rename files

* fix circular important and remove duplicate code

* Ruff formating

* Remove examples including cosine

* Add model_card_template to setup

* Renaming mapping dicts

* Fixing docstrings, examples and extends NanoBEIREvaluator

* Documentation

---------

Co-authored-by: Antoine Chaffin <antoine.chaffin@lighton.ai>
add `__future__` annotations & test python 3.9 to 3.12
…ation

[Draft] Setting normalize_scores default to False
…normalization"

This reverts commit 4d1d410, reversing
changes made to e349918.
Revert "Merge pull request lightonai#66 from lightonai/default_to_no_normaliza…
* Bumping transformers and st versions

* Make set_widget_examples a no-op

* Removing np.float from NanoBEIR trace logs test

* Clean similarity_fn_name in model card

* Remove create_model_card useless overriding

* Ruff formating

* Move comment to a proper docstring
* Read stanford metadata

* Do not override attend_to_expansion_tokens to False by default if read from stanford

* Normalize the overriding for query/doc prefixes

* Update the comment about the behavior of reading/overriding

* Change some warnings to info
- Addition of NanoBEIREvaluator, allowing to give quick signal about the learning during training
Support of Python 3.9

- Bump of transformers/ST versions, allowing to use ModernBERT in PyLate and also fixing an issue for loading models after training them with trust_remote_code=True

- Reading of Stanford-NLP models configurations (markers, attending to expansion tokens, ...), allowing to load models such as Jina-ColBERT without having to specify all these parameters
Some fixes
Samoed and others added 27 commits September 23, 2025 15:23
* remove print from index

* fix docstring
* Adding has_module

* Writing evaluator results to a json file

* Update dependency

* Bump ST version

* Overwrite get_model_type

* Somehow had to lint multiple time

---------

Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>
* Add activation function back and load the ones of ST models

* Convert ST layers and add a layer if dimensions are not correct

* typo in doc
* Add residual parameter

Co-authored-by: bclavie <bclavie@users.noreply.github.com>

* Log residual in config dict

Co-authored-by: bclavie <bclavie@users.noreply.github.com>

* What rebase do to a man

* Consistency

---------

Co-authored-by: bclavie <bclavie@users.noreply.github.com>
* probably ruff

* Module doc

* Add MixedBread paper

* Fix boilerplate

* lint

* somehow add to run lint 3 times

* Update docs/models/models.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/models/models.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/models/models.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>
* add check for missing parameters

* Update pylate/evaluation/beir.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* quantile fix

* lint fix

* routing quantile based on numel

* format correction

---------

Co-authored-by: Meet Doshi meet@ibm.com <meet@cccxc706.pok.ibm.com>
Co-authored-by: Meet Doshi meet@ibm.com <meet@cccxc704.pok.ibm.com>
Co-authored-by: Meet Doshi meet@ibm.com <meet@ccc-login3.pok.ibm.com>
* Handle fast-plaid id offset when deleting

* Delete folder when overriding in fast-plaid

* remove a folder is shutil

* Add test for fast-plaid deletion

* lint

* Up n_ivf_probe to make sure we catch all of the documents in edge cases

---------

Co-authored-by: Antoine Chaffin <antoine.chaffin@lighton.ai>
Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>
* Add base boilerplates

* Ruff
* Update collator

* Add missing functions to collator

* Update pylate/utils/collator.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Better defaults

* pass task to tokenize

* Add task as an arg of tokenize

---------

Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This adds ScaNN index support with fp16/fp32-preserving flattening, shared index utilities, scann extras, and focused tests for verbosity, dtype handling, and embedding retrieval behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
…s test

- Run ruff format/lint on scann.py and test_scann.py
- Replace per-token Python loop in __call__() result construction with
  np.split for vectorized reshaping
- Add test for numpy document embedding input (float16/float32)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efix*

Some LI models nowadays (e.g. XTR) don't use Q/D prefix tokens. This changes the behavior of the model initialization to no longer strongly default to adding [Q] and [D] prefix tokens.

This is probably an opinionated and *breaking* change for some people's use case (if they assume e.g. during training initialization it will add the prefix tokens). Another method might be to use the currently inert `add_special_tokens` argument which currently does not impact the logic at all.
@robro612 robro612 closed this Feb 18, 2026
@robro612 robro612 deleted the colbert-optional-prefix branch February 18, 2026 19:39
@robro612 robro612 restored the colbert-optional-prefix branch February 18, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants