About PlaceHolder Token <PH>

Hello,

I found you add a placeholder token <PH> to the tokenizer using the follow code:

      tokenizer._add_tokens(["<PH>"], special_tokens=True)
      tokenizer.placeholder_token = "<PH>"

And the placeholder token is used in the follow code:

        encoded_ph = tokenizer.convert_tokens_to_ids(tokenizer.placeholder_token)
        
        if len(truncated_rewrite) > len(truncated_query):
            truncated_query   += [encoded_ph] * (len(truncated_rewrite) - len(truncated_query))
        else:
            truncated_rewrite += [encoded_ph] * (len(truncated_query) - len(truncated_rewrite))

However, the index of this placeholder token has exceeded the size of the pre-trained vocabulary, so the embedding representation of this token is not available on the embedding table. How can this problem be solved? Do you need to replace placeholder token with existing tokens in the vocab? So what should I replace it with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About PlaceHolder Token <PH> #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About PlaceHolder Token <PH> #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions