NER training on dataset which was annotated on older version.

Zainpann · January 25, 2021, 7:34am

I annotated data on older version of prodigy, adn now when I db-in that data and train it on newer version of prodigy, its giving me the following error during training:

(spacy221) C:\Users\BNV>python -m prodigy train ner newenglish en_core_web_lg --output D:\latest\modelss
Loaded model 'en_core_web_lg'
C:\Users\BNV\Envs\spacy221\lib\site-packages\prodigy\recipes\train.py:453: UserWarning: [W030] Some entities could not be aligned in the text "chalo hamaray ilawa koi tou balance ki baat kertay..." with entities "[(0, 5, 'IGNORE'), (6, 13, 'IGNORE'), (14, 19, 'IG...". Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during training.
biluo = biluo_tags_from_offsets(doc, offsets, missing=missing_tag)
C:\Users\BNV\Envs\spacy221\lib\site-packages\prodigy\recipes\train.py:453: UserWarning: [W030] Some entities could not be aligned in the text "haha..."tu mera hero" bhi tha:-P;-)" with entities "[(0, 4, 'IGNORE'), (7, 10, 'IGNORE'), (11, 15, 'IG...". Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during training.
biluo = biluo_tags_from_offsets(doc, offsets, missing=missing_tag)
C:\Users\BNV\Envs\spacy221\lib\site-packages\prodigy\recipes\train.py:453: UserWarning: [W030] Some entities could not be aligned in the text "Arsal:-Sab kOo apni jan bachane ka haq ga.Jiya :-T..." with entities "[(0, 5, 'PERSON'), (7, 10, 'IGNORE'), (11, 14, 'IG...". Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during training.
biluo = biluo_tags_from_offsets(doc, offsets, missing=missing_tag)
Created and merged data for 20761 total examples
Using 16609 train / 4152 eval (split 20%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Baseline accuracy: 0.665

=========================== Training the model ===========================

Loss Precision Recall F-Score

1: 74%|█████████████████████████████████████████████████████▏ | 12281/16609 [02:44<00:24, 177.92it/s]Traceback (most recent call last):
File "C:\Users\BNV\AppData\Local\Programs\Python\Python36\Lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\BNV\AppData\Local\Programs\Python\Python36\Lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\BNV\Envs\spacy221\lib\site-packages\prodigy_main.py", line 60, in
controller = recipe(args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 300, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "C:\Users\BNV\Envs\spacy221\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "C:\Users\BNV\Envs\spacy221\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "C:\Users\BNV\Envs\spacy221\lib\site-packages\prodigy\recipes\train.py", line 163, in train
nlp.update(docs, annots, drop=dropout, losses=losses)
File "C:\Users\BNV\Envs\spacy221\lib\site-packages\spacy\language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "nn_parser.pyx", line 446, in spacy.syntax.nn_parser.Parser.update
File "nn_parser.pyx", line 551, in spacy.syntax.nn_parser.Parser._init_gold_batch
File "transition_system.pyx", line 102, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
File "transition_system.pyx", line 163, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the experimental debug-data command to validate your JSON-formatted training data. For details, run:
python -m spacy debug-data --help

ines · January 26, 2021, 11:18pm

Hi! It looks like your data ended up with misaligned tokens (which old versions of spaCy quietly skipped, but which it now raises an error about explicitly). Did you use the same tokenizer during annotation and training?

One easy way to find the misaligned examples, check what's wrong and/or just exclude them from your dataset would be to load your Prodigy dataset and use spaCy's Doc.char_span method to check that all spans refer to valid tokens. If there are only a few problematic examples, you could just skip them and save the filtered examples to a new dataset.

import spacy
from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("your_dataset_here")  # Prodigy dataset
nlp = spacy.blank("en")  # whichever language/model you used
for example in examples: 
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)

Topic		Replies	Views
UserWarning: [W030] Some entities could not be aligned in the text usage , ner , spacy	1	1595	April 23, 2021
Error while training NER model usage , spacy , training	4	1864	September 16, 2021
Prodigy annotations from older from to newer version usage , ner , spacy , solved	5	958	January 16, 2020
Prodigy annotations to SpaCy train spacy	13	5622	January 31, 2018
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	483	October 12, 2020

NER training on dataset which was annotated on older version.

Loss Precision Recall F-Score

Related topics