Skip to content

spaCy Tokenization Details #25

@tuzhucheng

Description

@tuzhucheng

Hi, in the README.md it says:

A tokenized version of the supporting passage, using spaCy. Each token is a tuple of the token string and token character offset. The maximum number of tokens is 800.

I'm wondering if you guys can provide the script used for tokenization (ideally) or describe more specific details about how the tokenization is done. For example:

  • Is the model en_core_web_sm, en_core_web_md, or en_core_web_lg?
  • What are the custom arguments passed to the Tokenizer? I assume token_match=lambda t: t in ('[PAR]', '[TLE]', '[DOC]') for example.
  • I believe I found some problem with tokenization in the data provided. If you look at line 5 of HotpotQA.jsonl, the [SEP] token is split up, which should not be in my opinion ([PAR], [TLE], [DOC] are not split up):
>>> sed '5q;d' HotpotQA.jsonl | jq '.context_tokens[2:7]'
[
  [
    "Ethanol",
    12
  ],
  [
    "[",
    20
  ],
  [
    "SEP",
    21
  ],
  [
    "]",
    24
  ],
  [
    "Ethanol",
    26
  ]
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions