Skip to content

Comments

Read and use tokenizer identifier from config#611

Merged
2015aroras merged 1 commit intomainfrom
shanea/hf-get-tokenizer-from-config-2
Jun 10, 2024
Merged

Read and use tokenizer identifier from config#611
2015aroras merged 1 commit intomainfrom
shanea/hf-get-tokenizer-from-config-2

Conversation

@2015aroras
Copy link
Collaborator

Correction of #610, tested on a 1.7 7B checkpoint.

@2015aroras 2015aroras requested a review from epwalsh June 10, 2024 19:11
Comment on lines +206 to +213
config_path = Path(checkpoint_dir) / "config.yaml"
tokenizer_config = yaml.safe_load(config_path.read_text())["tokenizer"]

# Initialize tokenizer and validate vocab size.
if Path(tokenizer_config["identifier"]).is_file():
base_tokenizer = Tokenizer.from_file(tokenizer_config["identifier"])
else:
base_tokenizer = Tokenizer.from_pretrained(tokenizer_config["identifier"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to do this:

Suggested change
config_path = Path(checkpoint_dir) / "config.yaml"
tokenizer_config = yaml.safe_load(config_path.read_text())["tokenizer"]
# Initialize tokenizer and validate vocab size.
if Path(tokenizer_config["identifier"]).is_file():
base_tokenizer = Tokenizer.from_file(tokenizer_config["identifier"])
else:
base_tokenizer = Tokenizer.from_pretrained(tokenizer_config["identifier"])
base_tokenizer = Tokenizer.from_checkpoint(checkpoint_dir)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Tokenizer from the tokenizers library, not the one from the OLMo code base. I made my first PR not remembering this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok

@2015aroras 2015aroras merged commit 578234d into main Jun 10, 2024
@2015aroras 2015aroras deleted the shanea/hf-get-tokenizer-from-config-2 branch June 10, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants