Skip to content

Fix a bug in the Transformers tokenizer#1817

Merged
RobinPicard merged 1 commit intomainfrom
fix_bug_transformers_tokenizer
Feb 6, 2026
Merged

Fix a bug in the Transformers tokenizer#1817
RobinPicard merged 1 commit intomainfrom
fix_bug_transformers_tokenizer

Conversation

@RobinPicard
Copy link
Contributor

Closes #1816

The problem identified in the issue above comes from an incoherent handling of tokens containing a spiece underline. The TransformerTokenizer class was adding a white space to their string representation for LlamaTokenizer, tokenizers, but some models do not use it but still use SentencePiece tokenization that requires this transformation.

We modify the convert_token_to_string method of TransformerTokenizer to apply the adding of a white space in case of tokens starting with a speice underline for all tokenizers, considering nothing would happen for those that do not need it because they do not use the spiece underline.

Copy link
Member

@SauravMaheshkar SauravMaheshkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as a fix, maybe we could double check if the bug is fixed by using the script from the original issue #1816?

@RobinPicard
Copy link
Contributor Author

Yes, it works as expected with the fix

@RobinPicard RobinPicard merged commit 1ea2b18 into main Feb 6, 2026
6 checks passed
@RobinPicard RobinPicard deleted the fix_bug_transformers_tokenizer branch February 6, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transformers backend seems to produce malformed JSON output with spaces in tokens

2 participants