Cannot train Arabic models with a custom tokenizer

This issue was initially about a possible bug in the _training pipeline_, related to the _parser_ (see below). But now I believe that posing preliminary questions is more appropriate:

- is it possible to create a completely _custom tokenizer_, which does not define custom rules and a few methods, but just redefines the main `__call__` method?
- in that case, where can I find documentation on how the tokenizer should use the Vocabulary API to feed the vocabulary while tokenizing?

### Some context information
In the discussion _Arabic language support_, comment _[I'm willing to prototype a spaCy language model for Arabic (SMA)](https://github.com/explosion/spaCy/discussions/7146#discussioncomment-8094879)_, I reported on the choice of a _training set_ and on the unsatisfactory training results obtained using the native spaCy _tokenizer_. Then, I reported on the integration/adaptation of an alternative tokenizer whose output, according to the printout of the _debug data_ command, shows a better alignment with the tokens in the training set (after a minor modification of the training set itself).

With the [subsequent comment](https://github.com/explosion/spaCy/discussions/7146#discussioncomment-8115239), in the same discussion, I reported on

1. an exception emitted by  a parser-related module of the spaCy training software, when executing the _train_ command with the same data and configuration as _debug data_;
2. the very bad results (low overall _score_) obtained with a reduced configuration, excluding the parser.

Here below is an excerpt of the _Traceback_ related to the exception (point 1).  You can find the full Traceback in the discussion to which I refer. 
```(omissis)
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser']")
Traceback (most recent call last):
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 298, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "C:\language310\lib\site-packages\spacy\language.py", line 1459, in evaluate
    for eg, doc in zip(examples, docs):
  File "C:\language310\lib\site-packages\spacy\language.py", line 1618, in pipe
    for doc in docs:
  File "C:\language310\lib\site-packages\spacy\util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy\pipeline\transition_parser.pyx", line 255, in pipe
  File "C:\language310\lib\site-packages\spacy\util.py", line 1704, in raise_error
    raise e
  File "spacy\pipeline\transition_parser.pyx", line 252, in spacy.pipeline.transition_parser.Parser.pipe
  File "spacy\pipeline\transition_parser.pyx", line 345, in spacy.pipeline.transition_parser.Parser.set_annotations
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 176, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 181, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '8206900633647566924'. This usually refers to an issue with the `Vocab` or `StringStore`."

The above exception was the direct cause of the following exception:
(omissis)
```
### My Environment
* Operating System: Windows 11
* Python Version Used: 3.10
* spaCy Version Used: 3.7


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot train Arabic models with a custom tokenizer #13248

Some context information

My Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cannot train Arabic models with a custom tokenizer #13248

Description

Some context information

My Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions