Skip to content

About PlaceHolder Token <PH> #8

@xyltt

Description

@xyltt

Hello,

I found you add a placeholder token to the tokenizer using the follow code:

  tokenizer._add_tokens(["<PH>"], special_tokens=True)
  tokenizer.placeholder_token = "<PH>"

And the placeholder token is used in the follow code:

    encoded_ph = tokenizer.convert_tokens_to_ids(tokenizer.placeholder_token)
    
    if len(truncated_rewrite) > len(truncated_query):
        truncated_query   += [encoded_ph] * (len(truncated_rewrite) - len(truncated_query))
    else:
        truncated_rewrite += [encoded_ph] * (len(truncated_query) - len(truncated_rewrite))

However, the index of this placeholder token has exceeded the size of the pre-trained vocabulary, so the embedding representation of this token is not available on the embedding table. How can this problem be solved? Do you need to replace placeholder token with existing tokens in the vocab? So what should I replace it with?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions