Make ColBERT prefix tokens optional for prefix-free models#1
Closed
Make ColBERT prefix tokens optional for prefix-free models#1
Conversation
* v1 API server * Changing model_name to model * Add a readme * Add docstring * Adding the server in the documentation * Add server to norecursedirs, add doctstring for the endpoint method * Enhancement of the documentation
Check if architectures is in the config before reading it
Make knowledge distillation processing compatible with DictDataset and Dataset.
…on about the parameter
* Use pytorch_model.bin to load stanford model instead of safetensor * Bump ST version * pin transformers version
* Creation of the PylateModelCard * Fixing ruff top import * Removing the example making tests fail * Changing Sentence Transformer model default to PyLate * Moving files to a dedicated subfolder * Removing all the awfully copy-pasted redundant code to extend ST properly * Adding init for hf_hub * Changing docstring for automatic parsing documentation * Consistency in the model args * Adding save to tests to test saving/model card creation
Add safetensor OR bin loading logic + add loading tests
* Initial working draft * Adding PyLate similarity function (maxsim) and use it as the default for ColBERT models * Remove default score function in NanoBEIR evaluator (not needed anymore) * Remove hardcoded similarity function in the model card template * Rename files * fix circular important and remove duplicate code * Ruff formating * Remove examples including cosine * Add model_card_template to setup * Renaming mapping dicts * Fixing docstrings, examples and extends NanoBEIREvaluator * Documentation --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@lighton.ai>
add `__future__` annotations & test python 3.9 to 3.12
…ation [Draft] Setting normalize_scores default to False
Revert "Merge pull request lightonai#66 from lightonai/default_to_no_normaliza…
* Bumping transformers and st versions * Make set_widget_examples a no-op * Removing np.float from NanoBEIR trace logs test * Clean similarity_fn_name in model card * Remove create_model_card useless overriding * Ruff formating * Move comment to a proper docstring
* Read stanford metadata * Do not override attend_to_expansion_tokens to False by default if read from stanford * Normalize the overriding for query/doc prefixes * Update the comment about the behavior of reading/overriding * Change some warnings to info
- Addition of NanoBEIREvaluator, allowing to give quick signal about the learning during training Support of Python 3.9 - Bump of transformers/ST versions, allowing to use ModernBERT in PyLate and also fixing an issue for loading models after training them with trust_remote_code=True - Reading of Stanford-NLP models configurations (markers, attending to expansion tokens, ...), allowing to load models such as Jina-ColBERT without having to specify all these parameters Some fixes
* remove print from index * fix docstring
* Adding has_module * Writing evaluator results to a json file * Update dependency * Bump ST version * Overwrite get_model_type * Somehow had to lint multiple time --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>
…, else raise Error (lightonai#162)
* Add activation function back and load the ones of ST models * Convert ST layers and add a layer if dimensions are not correct * typo in doc
* Add residual parameter Co-authored-by: bclavie <bclavie@users.noreply.github.com> * Log residual in config dict Co-authored-by: bclavie <bclavie@users.noreply.github.com> * What rebase do to a man * Consistency --------- Co-authored-by: bclavie <bclavie@users.noreply.github.com>
* probably ruff * Module doc * Add MixedBread paper * Fix boilerplate * lint * somehow add to run lint 3 times * Update docs/models/models.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/models/models.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/models/models.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: Antoine Chaffin <38869395+NohTow@users.noreply.github.com>
* add check for missing parameters * Update pylate/evaluation/beir.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* quantile fix * lint fix * routing quantile based on numel * format correction --------- Co-authored-by: Meet Doshi meet@ibm.com <meet@cccxc706.pok.ibm.com> Co-authored-by: Meet Doshi meet@ibm.com <meet@cccxc704.pok.ibm.com> Co-authored-by: Meet Doshi meet@ibm.com <meet@ccc-login3.pok.ibm.com>
* Handle fast-plaid id offset when deleting * Delete folder when overriding in fast-plaid * remove a folder is shutil * Add test for fast-plaid deletion * lint * Up n_ivf_probe to make sure we catch all of the documents in edge cases --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@lighton.ai> Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com>
* Add base boilerplates * Ruff
* Update collator * Add missing functions to collator * Update pylate/utils/collator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Better defaults * pass task to tokenize * Add task as an arg of tokenize --------- Co-authored-by: Antoine Chaffin <antoine.chaffin@icloud.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This adds ScaNN index support with fp16/fp32-preserving flattening, shared index utilities, scann extras, and focused tests for verbosity, dtype handling, and embedding retrieval behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
…s test - Run ruff format/lint on scann.py and test_scann.py - Replace per-token Python loop in __call__() result construction with np.split for vectorized reshaping - Add test for numpy document embedding input (float16/float32) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efix* Some LI models nowadays (e.g. XTR) don't use Q/D prefix tokens. This changes the behavior of the model initialization to no longer strongly default to adding [Q] and [D] prefix tokens. This is probably an opinionated and *breaking* change for some people's use case (if they assume e.g. during training initialization it will add the prefix tokens). Another method might be to use the currently inert `add_special_tokens` argument which currently does not impact the logic at all.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
query_prefixordocument_prefixisNone, no default marker tokens ([Q]/[D]) are forcedtokenize()is skipped entirely when no prefix is configuredmax_seq_lengthis no longer reduced by 1 when there's no prefix to insertThis enables loading models (e.g. XTR) that don't use ColBERT-style marker tokens without incorrect tokenization.
Depends on lightonai#195
Test plan
query_prefix=None, document_prefix=Noneskip prefix insertionpython -m pytest tests/ -qfor regression