Skip to content

Cannot train Arabic models with a custom tokenizer #13248

@gtoffoli

Description

@gtoffoli

This issue was initially about a possible bug in the training pipeline, related to the parser (see below). But now I believe that posing preliminary questions is more appropriate:

  • is it possible to create a completely custom tokenizer, which does not define custom rules and a few methods, but just redefines the main __call__ method?
  • in that case, where can I find documentation on how the tokenizer should use the Vocabulary API to feed the vocabulary while tokenizing?

Some context information

In the discussion Arabic language support, comment I'm willing to prototype a spaCy language model for Arabic (SMA), I reported on the choice of a training set and on the unsatisfactory training results obtained using the native spaCy tokenizer. Then, I reported on the integration/adaptation of an alternative tokenizer whose output, according to the printout of the debug data command, shows a better alignment with the tokens in the training set (after a minor modification of the training set itself).

With the subsequent comment, in the same discussion, I reported on

  1. an exception emitted by a parser-related module of the spaCy training software, when executing the train command with the same data and configuration as debug data;
  2. the very bad results (low overall score) obtained with a reduced configuration, excluding the parser.

Here below is an excerpt of the Traceback related to the exception (point 1). You can find the full Traceback in the discussion to which I refer.

⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser']")
Traceback (most recent call last):
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 298, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "C:\language310\lib\site-packages\spacy\language.py", line 1459, in evaluate
    for eg, doc in zip(examples, docs):
  File "C:\language310\lib\site-packages\spacy\language.py", line 1618, in pipe
    for doc in docs:
  File "C:\language310\lib\site-packages\spacy\util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy\pipeline\transition_parser.pyx", line 255, in pipe
  File "C:\language310\lib\site-packages\spacy\util.py", line 1704, in raise_error
    raise e
  File "spacy\pipeline\transition_parser.pyx", line 252, in spacy.pipeline.transition_parser.Parser.pipe
  File "spacy\pipeline\transition_parser.pyx", line 345, in spacy.pipeline.transition_parser.Parser.set_annotations
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 176, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 181, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '8206900633647566924'. This usually refers to an issue with the `Vocab` or `StringStore`."

The above exception was the direct cause of the following exception:
(omissis)

My Environment

  • Operating System: Windows 11
  • Python Version Used: 3.10
  • spaCy Version Used: 3.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions