spaCy Tokenization Details

Hi, in the `README.md` it says:

>>> A tokenized version of the supporting passage, using spaCy. Each token is a tuple of the token string and token character offset. The maximum number of tokens is 800.


I'm wondering if you guys can provide the script used for tokenization (ideally) or describe more specific details about how the tokenization is done. For example:

* Is the model `en_core_web_sm`, `en_core_web_md`, or `en_core_web_lg`?
* What are the custom arguments passed to the `Tokenizer`? I assume `token_match=lambda t: t in ('[PAR]', '[TLE]', '[DOC]')` for example.
* I believe I found some problem with tokenization in the data provided. If you look at line 5 of HotpotQA.jsonl, the `[SEP]` token is split up, which should not be in my opinion (`[PAR]`, `[TLE]`, `[DOC]` are not split up):

```bash
>>> sed '5q;d' HotpotQA.jsonl | jq '.context_tokens[2:7]'
[
  [
    "Ethanol",
    12
  ],
  [
    "[",
    20
  ],
  [
    "SEP",
    21
  ],
  [
    "]",
    24
  ],
  [
    "Ethanol",
    26
  ]
]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaCy Tokenization Details #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

spaCy Tokenization Details #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions