Skip to content

prepare_memmap_dataset.py seems to use wrong eos_token_id for the tokenizer #513

@wsonejoy

Description

@wsonejoy

❓ The question

In the file prepare_memmap_dataset.py, at lines 244 and 456, the following code snippet are found:

tokenizer = Tokenizer.from_pretrained(tokenizer_id, truncate_to=None)

This snippet lacks the eos_token_id parameter. The from_pretrained method, which is defined as follows, automatically assigns the last word in the vocabulary as the eos_token:

def from_pretrained(cls, identifier: str, **kwargs) -> Tokenizer:
    base_tokenizer = BaseTokenizer.from_pretrained(identifier)
    eos_token_id = kwargs.pop("eos_token_id", base_tokenizer.get_vocab_size() - 1)
    return cls(base_tokenizer, eos_token_id, **kwargs)

Metadata

Metadata

Assignees

Labels

type/questionAn issue that's a question

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions