-
Notifications
You must be signed in to change notification settings - Fork 30
Closed
Description
Hi, in the README.md it says:
A tokenized version of the supporting passage, using spaCy. Each token is a tuple of the token string and token character offset. The maximum number of tokens is 800.
I'm wondering if you guys can provide the script used for tokenization (ideally) or describe more specific details about how the tokenization is done. For example:
- Is the model
en_core_web_sm,en_core_web_md, oren_core_web_lg? - What are the custom arguments passed to the
Tokenizer? I assumetoken_match=lambda t: t in ('[PAR]', '[TLE]', '[DOC]')for example. - I believe I found some problem with tokenization in the data provided. If you look at line 5 of HotpotQA.jsonl, the
[SEP]token is split up, which should not be in my opinion ([PAR],[TLE],[DOC]are not split up):
>>> sed '5q;d' HotpotQA.jsonl | jq '.context_tokens[2:7]'
[
[
"Ethanol",
12
],
[
"[",
20
],
[
"SEP",
21
],
[
"]",
24
],
[
"Ethanol",
26
]
]Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels