-
Notifications
You must be signed in to change notification settings - Fork 2
About PlaceHolder Token <PH> #8
Copy link
Copy link
Open
Description
Hello,
I found you add a placeholder token to the tokenizer using the follow code:
tokenizer._add_tokens(["<PH>"], special_tokens=True)
tokenizer.placeholder_token = "<PH>"
And the placeholder token is used in the follow code:
encoded_ph = tokenizer.convert_tokens_to_ids(tokenizer.placeholder_token)
if len(truncated_rewrite) > len(truncated_query):
truncated_query += [encoded_ph] * (len(truncated_rewrite) - len(truncated_query))
else:
truncated_rewrite += [encoded_ph] * (len(truncated_query) - len(truncated_rewrite))
However, the index of this placeholder token has exceeded the size of the pre-trained vocabulary, so the embedding representation of this token is not available on the embedding table. How can this problem be solved? Do you need to replace placeholder token with existing tokens in the vocab? So what should I replace it with?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels