❓ The question
In the file prepare_memmap_dataset.py, at lines 244 and 456, the following code snippet are found:
tokenizer = Tokenizer.from_pretrained(tokenizer_id, truncate_to=None)
This snippet lacks the eos_token_id parameter. The from_pretrained method, which is defined as follows, automatically assigns the last word in the vocabulary as the eos_token:
def from_pretrained(cls, identifier: str, **kwargs) -> Tokenizer:
base_tokenizer = BaseTokenizer.from_pretrained(identifier)
eos_token_id = kwargs.pop("eos_token_id", base_tokenizer.get_vocab_size() - 1)
return cls(base_tokenizer, eos_token_id, **kwargs)