Make ColBERT prefix tokens optional for prefix-free models by robro612 · Pull Request #1 · robro612/pylate

robro612 · 2026-02-18T19:37:20Z

Summary

When query_prefix or document_prefix is None, no default marker tokens ([Q] / [D] ) are forced
Prefix insertion in tokenize() is skipped entirely when no prefix is configured
max_seq_length is no longer reduced by 1 when there's no prefix to insert
Token embedding resizing only runs when prefixes are actually defined

This enables loading models (e.g. XTR) that don't use ColBERT-style marker tokens without incorrect tokenization.

Note: The existing add_special_tokens constructor parameter was already a no-op (declared but never wired); this PR makes prefix behavior actually configurable via the existing query_prefix/document_prefix parameters.

Depends on lightonai#195

Test plan

Verify existing ColBERT models with prefixes still encode identically
Verify models loaded with query_prefix=None, document_prefix=None skip prefix insertion
Run python -m pytest tests/ -q for regression

* v1 API server * Changing model_name to model * Add a readme * Add docstring * Adding the server in the documentation * Add server to norecursedirs, add doctstring for the endpoint method * Enhancement of the documentation

Check if architectures is in the config before reading it

Make knowledge distillation processing compatible with DictDataset and Dataset.

…on about the parameter

…lder (lightonai#68)

* Use pytorch_model.bin to load stanford model instead of safetensor * Bump ST version * pin transformers version

* Creation of the PylateModelCard * Fixing ruff top import * Removing the example making tests fail * Changing Sentence Transformer model default to PyLate * Moving files to a dedicated subfolder * Removing all the awfully copy-pasted redundant code to extend ST properly * Adding init for hf_hub * Changing docstring for automatic parsing documentation * Consistency in the model args * Adding save to tests to test saving/model card creation

1.1.3 release

Add safetensor OR bin loading logic + add loading tests

1.1.4

* Initial working draft * Adding PyLate similarity function (maxsim) and use it as the default for ColBERT models * Remove default score function in NanoBEIR evaluator (not needed anymore) * Remove hardcoded similarity function in the model card template * Rename files * fix circular important and remove duplicate code * Ruff formating * Remove examples including cosine * Add model_card_template to setup * Renaming mapping dicts * Fixing docstrings, examples and extends NanoBEIREvaluator * Documentation --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@lighton.ai>

add `__future__` annotations & test python 3.9 to 3.12

…ation [Draft] Setting normalize_scores default to False

…normalization" This reverts commit 4d1d410, reversing changes made to e349918.

Revert "Merge pull request lightonai#66 from lightonai/default_to_no_normaliza…

* Bumping transformers and st versions * Make set_widget_examples a no-op * Removing np.float from NanoBEIR trace logs test * Clean similarity_fn_name in model card * Remove create_model_card useless overriding * Ruff formating * Move comment to a proper docstring

* Read stanford metadata * Do not override attend_to_expansion_tokens to False by default if read from stanford * Normalize the overriding for query/doc prefixes * Update the comment about the behavior of reading/overriding * Change some warnings to info

- Addition of NanoBEIREvaluator, allowing to give quick signal about the learning during training Support of Python 3.9 - Bump of transformers/ST versions, allowing to use ModernBERT in PyLate and also fixing an issue for loading models after training them with trust_remote_code=True - Reading of Stanford-NLP models configurations (markers, attending to expansion tokens, ...), allowing to load models such as Jina-ColBERT without having to specify all these parameters Some fixes

* remove print from index * fix docstring

* Adding has_module * Writing evaluator results to a json file * Update dependency * Bump ST version * Overwrite get_model_type * Somehow had to lint multiple time --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>

…, else raise Error (lightonai#162)

…ColBERT (lightonai#163)

* Add activation function back and load the ones of ST models * Convert ST layers and add a layer if dimensions are not correct * typo in doc

* Add residual parameter Co-authored-by: bclavie <bclavie@users.noreply.github.com> * Log residual in config dict Co-authored-by: bclavie <bclavie@users.noreply.github.com> * What rebase do to a man * Consistency --------- Co-authored-by: bclavie <bclavie@users.noreply.github.com>

* probably ruff * Module doc * Add MixedBread paper * Fix boilerplate * lint * somehow add to run lint 3 times * Update docs/models/models.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/models/models.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/models/models.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>

Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>

* add check for missing parameters * Update pylate/evaluation/beir.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* quantile fix * lint fix * routing quantile based on numel * format correction --------- Co-authored-by: Meet Doshi meet@ibm.com <meet@cccxc706.pok.ibm.com> Co-authored-by: Meet Doshi meet@ibm.com <meet@cccxc704.pok.ibm.com> Co-authored-by: Meet Doshi meet@ibm.com <meet@ccc-login3.pok.ibm.com>

* Handle fast-plaid id offset when deleting * Delete folder when overriding in fast-plaid * remove a folder is shutil * Add test for fast-plaid deletion * lint * Up n_ivf_probe to make sure we catch all of the documents in edge cases --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@lighton.ai> Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>

* Add base boilerplates * Ruff

* Update collator * Add missing functions to collator * Update pylate/utils/collator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Better defaults * pass task to tokenize * Add task as an arg of tokenize --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This adds ScaNN index support with fp16/fp32-preserving flattening, shared index utilities, scann extras, and focused tests for verbosity, dtype handling, and embedding retrieval behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

…s test - Run ruff format/lint on scann.py and test_scann.py - Replace per-token Python loop in __call__() result construction with np.split for vectorized reshaping - Add test for numpy document embedding input (float16/float32) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…efix* Some LI models nowadays (e.g. XTR) don't use Q/D prefix tokens. This changes the behavior of the model initialization to no longer strongly default to adding [Q] and [D] prefix tokens. This is probably an opinionated and *breaking* change for some people's use case (if they assume e.g. during training initialization it will add the prefix tokens). Another method might be to use the currently inert `add_special_tokens` argument which currently does not impact the logic at all.

NohTow and others added 30 commits October 2, 2024 15:08

Adding a FastAPI server (lightonai#60)

3f71cfc

* v1 API server * Changing model_name to model * Add a readme * Add docstring * Adding the server in the documentation * Add server to norecursedirs, add doctstring for the endpoint method * Enhancement of the documentation

Check if architectures is in the config before reading it

8998841

Merge pull request lightonai#62 from lightonai/fix_roberta

98c4e2b

Check if architectures is in the config before reading it

Update processing.py

7074625

Merge pull request lightonai#63 from lightonai/kdprocessing-dataset

ddaf8f8

Make knowledge distillation processing compatible with DictDataset and Dataset.

Setting normalize_scores default to False and adding some documentati…

7a0671f

…on about the parameter

Fixing issue with ST 3.1 and bumping version (lightonai#65)

b89517f

Correctly setting safetensor path if the stanford model is a local fo…

b34bba0

…lder (lightonai#68)

Stanford loading using bin file (lightonai#69)

11d2189

* Use pytorch_model.bin to load stanford model instead of safetensor * Bump ST version * pin transformers version

Update __version__.py

590e227

Merge pull request lightonai#71 from lightonai/bump-version-1.1.3

0af6c4c

1.1.3 release

Add safetensor OR bin loading logic + add loading tests

37bcf88

Remove values test for inited models

20b8061

Adding log of the fall back + add commented tests

6399410

Merge pull request lightonai#73 from lightonai/fix_loading

d12825a

Add safetensor OR bin loading logic + add loading tests

Update __version__.py

32dffe5

Merge pull request lightonai#74 from lightonai/bump-version-1.1.4

8de1842

1.1.4

add future

347718e

ruff down version

86acef1

lint tests

7a10877

lint

22b1179

Merge pull request lightonai#80 from Samoed/support3.9

e349918

add `__future__` annotations & test python 3.9 to 3.12

Merge pull request lightonai#66 from lightonai/default_to_no_normaliz…

4d1d410

…ation [Draft] Setting normalize_scores default to False

Revert "Merge pull request lightonai#66 from lightonai/default_to_no_…

988ef55

…normalization" This reverts commit 4d1d410, reversing changes made to e349918.

Merge pull request lightonai#84 from lightonai/revert-normalization

c6a22fc

Revert "Merge pull request lightonai#66 from lightonai/default_to_no_normaliza…

Samoed and others added 27 commits September 23, 2025 15:23

Change print to logging in index (lightonai#154)

2c702c0

* remove print from index * fix docstring

bump-version-1.3.1 (lightonai#157)

b04a0af

Bump fast-plaid version 1.2.4 (lightonai#158)

f63acba

[DRAFT] ST 5.0 (lightonai#151)

4c77bb6

* Adding has_module * Writing evaluator results to a json file * Update dependency * Bump ST version * Overwrite get_model_type * Somehow had to lint multiple time --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>

Clear cache CI (lightonai#164)

bab066a

Try to load mask token, then eos, then default to pad token if exists…

70d664e

…, else raise Error (lightonai#162)

Use load_sbert when has_module is true even if the class name is not …

63b6406

…ColBERT (lightonai#163)

Load multiple dense (lightonai#165)

976698e

* Add activation function back and load the ones of ST models * Convert ST layers and add a layer if dimensions are not correct * typo in doc

Bump version 1.3.3 (lightonai#167)

38fcc2d

Bump transformers version (lightonai#168)

b1453fd

Add spell checker (lightonai#173)

30fc002

Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>

Set voyager as an extra-dependancy and update CI (lightonai#174)

23a76f9

Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>

Add check for missing parameters (D417) (lightonai#170)

1f4d04e

* add check for missing parameters * Update pylate/evaluation/beir.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

add typos to ci (lightonai#175)

c9a4910

Bump fast-plaid version 1.2.5, Torch 2.9.0 compatibility (lightonai#177)

b3da7e4

Add docs ci (lightonai#176)

dfdc473

Bump fast-plaid version 1.3.0 (lightonai#182)

b331b22

Update bibtext (lightonai#187)

dfeae83

Add LateOn Code boilerplates (lightonai#194)

8a9a6ca

* Add base boilerplates * Ruff

robro612 closed this Feb 18, 2026

robro612 deleted the colbert-optional-prefix branch February 18, 2026 19:39

robro612 restored the colbert-optional-prefix branch February 18, 2026 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ColBERT prefix tokens optional for prefix-free models#1

Make ColBERT prefix tokens optional for prefix-free models#1
robro612 wants to merge 89 commits intomainfrom
colbert-optional-prefix

robro612 commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

robro612 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

robro612 commented Feb 18, 2026 •

edited

Loading