Fix a bug in the Transformers tokenizer by RobinPicard · Pull Request #1817 · dottxt-ai/outlines

RobinPicard · 2026-02-05T15:35:13Z

The problem identified in the issue above comes from an incoherent handling of tokens containing a spiece underline. The TransformerTokenizer class was adding a white space to their string representation for LlamaTokenizer, tokenizers, but some models do not use it but still use SentencePiece tokenization that requires this transformation.

We modify the convert_token_to_string method of TransformerTokenizer to apply the adding of a white space in case of tokens starting with a speice underline for all tokenizers, considering nothing would happen for those that do not need it because they do not use the spiece underline.

SauravMaheshkar

Looks good as a fix, maybe we could double check if the bug is fixed by using the script from the original issue #1816?

RobinPicard · 2026-02-06T07:28:16Z

Yes, it works as expected with the fix

Fix a bug in the Transformers tokenizer

92977a2

RobinPicard requested a review from SauravMaheshkar February 5, 2026 15:48

SauravMaheshkar reviewed Feb 5, 2026

View reviewed changes

SauravMaheshkar added the tokenization label Feb 5, 2026

RobinPicard merged commit 1ea2b18 into main Feb 6, 2026
6 checks passed

RobinPicard deleted the fix_bug_transformers_tokenizer branch February 6, 2026 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in the Transformers tokenizer#1817

Fix a bug in the Transformers tokenizer#1817
RobinPicard merged 1 commit intomainfrom
fix_bug_transformers_tokenizer

RobinPicard commented Feb 5, 2026

Uh oh!

SauravMaheshkar left a comment

Uh oh!

RobinPicard commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RobinPicard commented Feb 5, 2026

Uh oh!

SauravMaheshkar left a comment

Choose a reason for hiding this comment

Uh oh!

RobinPicard commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants