Cannot debug Annotation Data to Train NER model.

I am training a model with around 100k annotations using the train ner recipe.
On the first iteration I get the following error;

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse...
... For details, run:
python -m spacy debug-data --help

Based on this support inquiry, I understand i need to check there are no entity spans with leading/trailing white spaces, so i create a json-files (not jsonl files) using db-out . Then, as suggest by the above error, I run debug-data.

$ python -m spacy debug-data ja ./train.json ./dev.json -b ./models/my_model/

This gives me the following error:

=========================== Data format validation ===========================
✘ Training data cannot be loaded: too many values to unpack (expected 2)
✘ Development data cannot be loaded: too many values to unpack (expected 2)

Below is a sample document from my annotation data.

{"meta":{"file":"/file1.json","pattern":""},"text":" シンガポール滞在中","spans":[{"start":1,"end":7,"label":"LOCATION"}]}

I believe I have correctly formatted the data to run debug-data, but is there something i have missed?

I just looked in to the initial problem (ValueError: [E024]) and checked for any leading/trailing whitespace or punctuation characters (!"#$%&'()*+, -./:;<=>?@[]^_`{|}~) in all entities.

I just confirmed that none of the entities have leading/trailing whitespace, but some have entities punctutation characters in them.

The second condition seems a bit restrictive, since this would mean that a DISTANCE entity with value "2.5km" cannot be trained on, right?

My first guess for why prodigy train is failing like this would be that there's a mismatch between the tokenizer used during annotation and the one used during training. What tokenizer are you using for Japanese?

Use prodigy data-to-spacy to export the data directly to spacy's training JSON training format: https://prodi.gy/docs/recipes#data-to-spacy . The format for spacy debug-data should look like this training-data.json with BILUO tags as the "ner" values under each token.

I think entities with leading and trailing whitespace are invalid, but punctuation is only a problem when it's a separate token at the beginning or end of an entity (which isn't the case for 2.5km) and you've also enabled a training option to add noise. As long as you don't enable the option to add noise to the training data, it should be fine. (One of the methods that adds noise can replace punctuation with whitespace and this leads to the invalid whitespace issue.) In any case, the debug-data output should have a warning if this is relevant for your data.

My NER data is annotated using third party software, and I am using ginza when training, so it's possible that there is a mismatch between the tokenizer used for annotation and that of training.

That said , when I run prodigy train I see the following warning message appear several times, which tells me that any misaligned entities will be ignored, so I would expect prodigy to be able to train a model without any problem, since any mismatches are being dropped when training.

/lib/python3.6/site-packages/prodigy/recipes/train.py:453: UserWarning: [W030] Some entities could not be aligned in the text "11月の大相撲九州場所で初優勝し、初場所(来年1月13日初日・両国国技館)で新関脇に昇進した貴景勝が..." with entities "[(0, 3, 'TEMPORAL:DATE'), (7, 9, 'LOCATION'), (23,...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.

When running prodigy data-to-spacy, using the below, I get a similar message as well, but another error appears, asking for a component to be added. This also the case if I replace ginza with spacy's ja_core_news_lg.

prodigy data-to-spacy ./ner.spacy.json ./ner.spacy.eval.json -l ja -n ner_table -m ./models/ja_ginza/

...
python3.6/site-packages/prodigy/recipes/train.py:453: UserWarning: [W030] Some entities could not be aligned in the text "遊園地のアトラクション(遊戯施設)における待ち時間を高精度で再現するモデル..." with entities "[(1, 4, 'ORGANIZATION'), (147, 157, 'ORGANIZATION'...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
...

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

To be able to export the data, you can load models/ja_ginza and add a sentencizer with:

import spacy
nlp = spacy.load("models/ja_ginza")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.to_disk("/path/to/ja_ginza_sentencizer")

But if your annotation comes from another source and you're using ginza or ja_core_news_sm (both use sudachipy, I think?) then this mismatch is probably the source of the "no optimal move" error. My guess is that most of your annotation is being discarded because it doesn't line up with the sudachipy tokens.

You'll need a spacy model with a tokenizer that produces the same tokenization as used in your annotation (or very close) to be able to train a useful model. For most of our models we try to get the tokenization accuracy to at least 99%. For Chinese it's only 95% and even this difference can have a very large impact on the accuracy of the trained components.

Once you have the data in the JSON training format, you can check the tokenization accuracy (as TOK) with:

$ spacy evaluate /path/to/model dev.json