Prodigy 1.90 train recipe --ner-missing argument

Hi. A bit confused here. The new --ner-missing argument on the train ner recipe is that the same as the old --no-missing argument? Or is it the exact opposite?

Yes, it's the opposite: --no-missing means there are no missing values, --ner-missing means that the NER data contains missing values (and was created using a binary recipe etc.). Also see the argument description here.

The previous ner.batch-train recipe had a stronger focus on training from binary incomplete annotations. The new train recipe is a more general-purpose training recipe that assumes annotations are complete by default, which is more consistent. If unannotated tokens should be treated as unknown/missing values, you can explicitly set --ner-missing.

Hi @ines, thanks for the fast reply. I had looked at the documentation. The text from the documentation was the source of my confusion :grinning:

Your explanation here now made it perfectly clear to me.

If it is not asked too much, could you please drop a few more lines to explain when exactly the --binary argument is needed in ner training and could you please explain, why training ner with this option prints the overall accuracy for the trained model and not the per entity recall, precision and f-score values?

Thank you very much in advance for your patience and effort.

In a typical training scenario, you're updating a model with examples and the correct answer – e.g. with a text and the entities in it. In some cases you may also have partial annotations: you know some entities but not all.

Prodigy's active learning recipes like ner.teach also let you collect binary yes/no decisions. The data you create here is different again: for some spans, you know that they are entities, because you accepted them. For the ones you rejected, you know that they're not of type X – but they could potentially be something else. This requires a different way of updating the model: you want to update with the positive examples where you know the answer, and proportionally with the "negative" example where you only know that a certain label doesn't apply. That's the type of training the --binary flag enables.

Fine-grained per-label accuracy is a very new feature in spaCy, so we only just added that to the regular training recipe in Prodigy in v1.9. The binary training requires very different evaluation (for the reasons explained above), so if we wanted more fine-grainde accuracy, we'd have to come up with our own implementation and logic for it. It's also not clear if it translates well and makes it easier to reason about the results.

Thank you so much @ines.
Let’s see whether I understood you correctly:

I’m trying to tweak NER in the German de_core_news_sm model to my needs. I’ve collected 16,000 gold annotated sentences using Prodigy 1.8x and the „old“ ner.make-gold recipe with the following arguments:

ner.make-gold de_ner_gold de_core_news_sm sentences.jsonl 
--label labels.txt --unsegmented

labels.txt contains the following labels:

The already existing labels LOC (keeping Countries, cities and states), ORG, PER plus five new labels JOBTITLE, PRODUCT, FAC (facilities: Buildings, airports, highways, bridges, etc.), GEO (Non-GPE locations, mountain ranges, bodies of water.), EVENT.

MISC was deliberately ignored as it is considered not useful.

In Prodigy 1.9x ner.make-gold is now deprecated. So my command for collecting the gold annotations would translate to:

ner.correct de_ner_gold de_core_news_sm sentences.jsonl 
--label labels.txt --unsegmented

Correct?

The de_core_news_sm model was then trained using Prodigy 1.8x and the „old“ ner.batch-train recipe with the following arguments:

ner.batch-train de_ner_gold de_core_news_sm --output model_ner_gold 
--label labels.txt --eval-split 0.2 --n-iter 50 --batch-size 32 
--unsegmented --no-missing

The --no-missing argument was used because @honnibal recommended using it in his answer to my earlier post Is there something wrong in general with the German model? and this indeed improved accuracy quit a bit. The model trained with this argument achieved an overall accuracy of 76.5 %.

In Prodigy 1.9x ner.batch-train is now deprecated. So my command for training the de_core_news_sm model would translate to:

train ner de_ner_gold de_core_news_sm --output model_ner_gold 
--eval-split 0.2 --n-iter 50 --batch-size 32

The arguments --label, --unsegmented and --no-missing are no longer supported.

--label is no longer needed as the train ner recipe detects the labels from the annotation.
--unsegmented is not needed as the train ner recipe uses the text from the annotation as is.
--ner-missing ist not needed as it is the contrary of --no-missing. --no-missing is now the default behavior in NER training as @ines explained above.

Correct?

When using the above statement, I get the following result from training:

Label      Precision Recall F-Score
--------   ---------   ------   -------
ORG           86.443   85.432    85.935
PRODUCT       83.897   85.572    84.726
EVENT         79.348   66.364    72.277
FAC           58.696   50.943    54.545
LOC           92.209   88.634    90.386
PER           90.086   89.574    89.829
JOBTITLE      82.228   78.571    80.358
GEO           50.000   10.000    16.667

Best F-Score   86.177
Baseline       42.775

All together I’m very pleased with the much more informative and detailed per label output of the results. Although these results don’t allow a direct comparison with the „old“ overall accuracy value, I can see that the results from the new train ner recipe have greatly improved, as an overall F-Score of 86.177 is clearly better than an overall accuracy of 76.5 %. Isn't it?

As the training data hasn’t changed, I can only assume that these improved results either come from the new German spaCy model supplied with spaCy v2.2 or the improved train ner recipe or a combination of the two.

Anyhow: These results additionally clearly indicate that especially the FAC and GEO labels need further training, a fact that I already had found out by testing and evaluating the model after it was trained with Prodigy 1.8x and the ner.batch-train recipe.

So in order to improve the model I had collected further examples using Prodigy 1.8x and the ner.teach recipe.

Here is what I used to annotate:

ner.teach de_ner_silver model_ner_gold de_more_sentences.jsonl 
--label labels.txt --unsegmented

labels.txt contained the same labels as above.

In Prodigy 1.9x this recipe is unchanged. Correct?

And here is what I used to train my model in Prodigy 1.8x:

ner.batch-train de_ner_silver model_ner_gold 
--output model_ner_silver 
--label labels.txt --eval-split 0.20 --n-iter 20 --batch-size 32 
--unsegmented

In Prodigy 1.9x this would translate to:

train ner de_ner_silver model_ner_gold --output model_ner_silver 
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary --ner-missing

Correct?

What I had noticed after training was that the model completely "forgot" about the the MISC label.

In my case this was actually what I wanted because the MISC label wasn’t very useful for me anyway. Still I wanted to know, why that is and googling led me to an article about the „catastrophic forgetting problem“: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting.

After reading @honnibal’s article I wasn’t sure, whether this problem would only occur if a certain label isn’t contained in the training data at all or whether missing important examples for a label can lead to a partial forgetting inside this label. So I decided to split my gold examples into silver annotations and add these to the annotations that I created using the ner.teach recipe.

Additionally I programmatically added fake rejects to the annotations in order to have an equal amount of accepts and rejects per entity. The fake rejects where created on the idea that a person cannot be an organization, an organization cannot be a location, a location cannot be an event, and so on. So per each accept for a span I added a reject for another label type.

After adding only a few more example for the LOC and FAC labels, this already lead to the interesting result that with Prodigy 1.8x batch.train recipe the overall accuracy climbed from 76.5 % to round about 90%.

So I ran another experiment:

First I trained the de_core_news_sm model with the gold annotations as described above.

Then I trained the resulting model again with the gold annotations converted to silver annotations (plus the additional fake rejects) without adding any additional training data for FAC and GEO.

With Prodigy 1.8x batch.train recipe the overall accuracy again climbed from 76.5 % to round about 90%.

With Prodigy 1.9x and the train ner recipe this improvement is even more drastic:

First go with

train ner de_ner_gold de_core_news_sm --output model_ner_gold 
--eval-split 0.2 --n-iter 50 --batch-size 32

Results (as above):

Label      Precision   Recall   F-Score
--------   ---------   ------   -------
ORG           86.443   85.432    85.935
PRODUCT       83.897   85.572    84.726
EVENT         79.348   66.364    72.277
FAC           58.696   50.943    54.545
LOC           92.209   88.634    90.386
PER           90.086   89.574    89.829
JOBTITLE      82.228   78.571    80.358
GEO           50.000   10.000    16.667


Best F-Score   86.177
Baseline       42.775

Second go with:

train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver 
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary

with the output:

Correct     6924 
Incorrect   406  
Baseline    0.944             
Accuracy    0.945

Alternate second go with (omitted --binary argument):

train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver 
--eval-split 0.2 --n-iter 20 --batch-size 32

with the output:

Label      Precision   Recall   F-Score
--------   ---------   ------   -------
LOC           98.462   97.907    98.183
PER           98.207   97.624    97.915
JOBTITLE      96.451   95.483    95.965
ORG           96.378   96.456    96.417
PRODUCT       95.085   95.942    95.512
EVENT         95.276   91.667    93.436
FAC           91.228   81.250    85.950
GEO          100.000   50.000    66.667


Best F-Score   96.668
Baseline       96.765       

Not sure how to interpret these results. Can training a model again with the same data really lead to such an improvement?

Final question for now:

When googling for „LOSS in NLP“ I found the following:

A loss function is going to serve as a measurement of how far our current set of predictions are from the corresponding true values. Some examples of loss functions that are commonly used in machine learning include: Mean-Squared-Error.

Only asking, because @honnibal stated somewhere - I think it was on this forum- that this value should decrease during training aiming for zero without ever reaching zero (IIRC).

Here is what I get, when training with Prodigy 1.95 and teach ner. The loss values with Prodigy 1.8x were equally high. Is that expected?

#    Loss       Precision   Recall     F-Score 
--   --------   ---------   --------   --------
 1   91049.41      75.148     64.773     69.576                                                                                                                                                                                                                                                                               
 2   84954.84      77.621     70.256     73.755                                                                                                                                                                                                                                                                               
 3   82770.17      79.641     74.460     76.964                                                                                                                                                                                                                                                                               
 4   81367.21      80.886     76.761     78.770                                                                                                                                                                                                                                                                               
 5   80145.44      81.918     78.253     80.044                                                                                                                                                                                                                                                                               

45   71295.62      86.974     85.355     86.157                                                                                                                                                                                                                                                                               
46   71881.20      86.804     85.312     86.052                                                                                                                                                                                                                                                                               
47   71647.44      86.876     85.284     86.073                                                                                                                                                                                                                                                                               
48   70608.79      86.934     85.341     86.130                                                                                                                                                                                                                                                                               
49   71308.20      86.965     85.384     86.167                                                                                                                                                                                                                                                                               
50   70625.16      86.956     85.412     86.177       

Sorry for the lengthy post and the many questions.

Kind regards, kamiwa