This issue was initially about a possible bug in the training pipeline, related to the parser (see below). But now I believe that posing preliminary questions is more appropriate:
- is it possible to create a completely custom tokenizer, which does not define custom rules and a few methods, but just redefines the main
__call__ method?
- in that case, where can I find documentation on how the tokenizer should use the Vocabulary API to feed the vocabulary while tokenizing?
Some context information
In the discussion Arabic language support, comment I'm willing to prototype a spaCy language model for Arabic (SMA), I reported on the choice of a training set and on the unsatisfactory training results obtained using the native spaCy tokenizer. Then, I reported on the integration/adaptation of an alternative tokenizer whose output, according to the printout of the debug data command, shows a better alignment with the tokens in the training set (after a minor modification of the training set itself).
With the subsequent comment, in the same discussion, I reported on
- an exception emitted by a parser-related module of the spaCy training software, when executing the train command with the same data and configuration as debug data;
- the very bad results (low overall score) obtained with a reduced configuration, excluding the parser.
Here below is an excerpt of the Traceback related to the exception (point 1). You can find the full Traceback in the discussion to which I refer.
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser']")
Traceback (most recent call last):
File "C:\language310\lib\site-packages\spacy\training\loop.py", line 298, in evaluate
scores = nlp.evaluate(dev_corpus(nlp))
File "C:\language310\lib\site-packages\spacy\language.py", line 1459, in evaluate
for eg, doc in zip(examples, docs):
File "C:\language310\lib\site-packages\spacy\language.py", line 1618, in pipe
for doc in docs:
File "C:\language310\lib\site-packages\spacy\util.py", line 1685, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy\pipeline\transition_parser.pyx", line 255, in pipe
File "C:\language310\lib\site-packages\spacy\util.py", line 1704, in raise_error
raise e
File "spacy\pipeline\transition_parser.pyx", line 252, in spacy.pipeline.transition_parser.Parser.pipe
File "spacy\pipeline\transition_parser.pyx", line 345, in spacy.pipeline.transition_parser.Parser.set_annotations
File "spacy\pipeline\_parser_internals\nonproj.pyx", line 176, in spacy.pipeline._parser_internals.nonproj.deprojectivize
File "spacy\pipeline\_parser_internals\nonproj.pyx", line 181, in spacy.pipeline._parser_internals.nonproj.deprojectivize
File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '8206900633647566924'. This usually refers to an issue with the `Vocab` or `StringStore`."
The above exception was the direct cause of the following exception:
(omissis)
My Environment
- Operating System: Windows 11
- Python Version Used: 3.10
- spaCy Version Used: 3.7
This issue was initially about a possible bug in the training pipeline, related to the parser (see below). But now I believe that posing preliminary questions is more appropriate:
__call__method?Some context information
In the discussion Arabic language support, comment I'm willing to prototype a spaCy language model for Arabic (SMA), I reported on the choice of a training set and on the unsatisfactory training results obtained using the native spaCy tokenizer. Then, I reported on the integration/adaptation of an alternative tokenizer whose output, according to the printout of the debug data command, shows a better alignment with the tokens in the training set (after a minor modification of the training set itself).
With the subsequent comment, in the same discussion, I reported on
Here below is an excerpt of the Traceback related to the exception (point 1). You can find the full Traceback in the discussion to which I refer.
My Environment