Prodigy 1.90 train recipe --ner-missing argument

Hi. A bit confused here. The new --ner-missing argument on the train ner recipe is that the same as the old --no-missing argument? Or is it the exact opposite?

Yes, it's the opposite: --no-missing means there are no missing values, --ner-missing means that the NER data contains missing values (and was created using a binary recipe etc.). Also see the argument description here.

The previous ner.batch-train recipe had a stronger focus on training from binary incomplete annotations. The new train recipe is a more general-purpose training recipe that assumes annotations are complete by default, which is more consistent. If unannotated tokens should be treated as unknown/missing values, you can explicitly set --ner-missing.

1 Like

Hi @ines, thanks for the fast reply. I had looked at the documentation. The text from the documentation was the source of my confusion :grinning:

Your explanation here now made it perfectly clear to me.

If it is not asked too much, could you please drop a few more lines to explain when exactly the --binary argument is needed in ner training and could you please explain, why training ner with this option prints the overall accuracy for the trained model and not the per entity recall, precision and f-score values?

Thank you very much in advance for your patience and effort.

In a typical training scenario, you're updating a model with examples and the correct answer – e.g. with a text and the entities in it. In some cases you may also have partial annotations: you know some entities but not all.

Prodigy's active learning recipes like ner.teach also let you collect binary yes/no decisions. The data you create here is different again: for some spans, you know that they are entities, because you accepted them. For the ones you rejected, you know that they're not of type X – but they could potentially be something else. This requires a different way of updating the model: you want to update with the positive examples where you know the answer, and proportionally with the "negative" example where you only know that a certain label doesn't apply. That's the type of training the --binary flag enables.

Fine-grained per-label accuracy is a very new feature in spaCy, so we only just added that to the regular training recipe in Prodigy in v1.9. The binary training requires very different evaluation (for the reasons explained above), so if we wanted more fine-grainde accuracy, we'd have to come up with our own implementation and logic for it. It's also not clear if it translates well and makes it easier to reason about the results.

Thank you so much @ines.
Let’s see whether I understood you correctly:

I’m trying to tweak NER in the German de_core_news_sm model to my needs. I’ve collected 16,000 gold annotated sentences using Prodigy 1.8x and the „old“ ner.make-gold recipe with the following arguments:

ner.make-gold de_ner_gold de_core_news_sm sentences.jsonl 
--label labels.txt --unsegmented

labels.txt contains the following labels:

The already existing labels LOC (keeping Countries, cities and states), ORG, PER plus five new labels JOBTITLE, PRODUCT, FAC (facilities: Buildings, airports, highways, bridges, etc.), GEO (Non-GPE locations, mountain ranges, bodies of water.), EVENT.

MISC was deliberately ignored as it is considered not useful.

In Prodigy 1.9x ner.make-gold is now deprecated. So my command for collecting the gold annotations would translate to:

ner.correct de_ner_gold de_core_news_sm sentences.jsonl 
--label labels.txt --unsegmented

Correct?

The de_core_news_sm model was then trained using Prodigy 1.8x and the „old“ ner.batch-train recipe with the following arguments:

ner.batch-train de_ner_gold de_core_news_sm --output model_ner_gold 
--label labels.txt --eval-split 0.2 --n-iter 50 --batch-size 32 
--unsegmented --no-missing

The --no-missing argument was used because @honnibal recommended using it in his answer to my earlier post Is there something wrong in general with the German model? and this indeed improved accuracy quit a bit. The model trained with this argument achieved an overall accuracy of 76.5 %.

In Prodigy 1.9x ner.batch-train is now deprecated. So my command for training the de_core_news_sm model would translate to:

train ner de_ner_gold de_core_news_sm --output model_ner_gold 
--eval-split 0.2 --n-iter 50 --batch-size 32

The arguments --label, --unsegmented and --no-missing are no longer supported.

--label is no longer needed as the train ner recipe detects the labels from the annotation.
--unsegmented is not needed as the train ner recipe uses the text from the annotation as is.
--ner-missing ist not needed as it is the contrary of --no-missing. --no-missing is now the default behavior in NER training as @ines explained above.

Correct?

When using the above statement, I get the following result from training:

Label      Precision Recall F-Score
--------   ---------   ------   -------
ORG           86.443   85.432    85.935
PRODUCT       83.897   85.572    84.726
EVENT         79.348   66.364    72.277
FAC           58.696   50.943    54.545
LOC           92.209   88.634    90.386
PER           90.086   89.574    89.829
JOBTITLE      82.228   78.571    80.358
GEO           50.000   10.000    16.667

Best F-Score   86.177
Baseline       42.775

All together I’m very pleased with the much more informative and detailed per label output of the results. Although these results don’t allow a direct comparison with the „old“ overall accuracy value, I can see that the results from the new train ner recipe have greatly improved, as an overall F-Score of 86.177 is clearly better than an overall accuracy of 76.5 %. Isn't it?

As the training data hasn’t changed, I can only assume that these improved results either come from the new German spaCy model supplied with spaCy v2.2 or the improved train ner recipe or a combination of the two.

Anyhow: These results additionally clearly indicate that especially the FAC and GEO labels need further training, a fact that I already had found out by testing and evaluating the model after it was trained with Prodigy 1.8x and the ner.batch-train recipe.

So in order to improve the model I had collected further examples using Prodigy 1.8x and the ner.teach recipe.

Here is what I used to annotate:

ner.teach de_ner_silver model_ner_gold de_more_sentences.jsonl 
--label labels.txt --unsegmented

labels.txt contained the same labels as above.

In Prodigy 1.9x this recipe is unchanged. Correct?

And here is what I used to train my model in Prodigy 1.8x:

ner.batch-train de_ner_silver model_ner_gold 
--output model_ner_silver 
--label labels.txt --eval-split 0.20 --n-iter 20 --batch-size 32 
--unsegmented

In Prodigy 1.9x this would translate to:

train ner de_ner_silver model_ner_gold --output model_ner_silver 
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary --ner-missing

Correct?

What I had noticed after training was that the model completely "forgot" about the the MISC label.

In my case this was actually what I wanted because the MISC label wasn’t very useful for me anyway. Still I wanted to know, why that is and googling led me to an article about the „catastrophic forgetting problem“: Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP · Explosion.

After reading @honnibal’s article I wasn’t sure, whether this problem would only occur if a certain label isn’t contained in the training data at all or whether missing important examples for a label can lead to a partial forgetting inside this label. So I decided to split my gold examples into silver annotations and add these to the annotations that I created using the ner.teach recipe.

Additionally I programmatically added fake rejects to the annotations in order to have an equal amount of accepts and rejects per entity. The fake rejects where created on the idea that a person cannot be an organization, an organization cannot be a location, a location cannot be an event, and so on. So per each accept for a span I added a reject for another label type.

After adding only a few more example for the LOC and FAC labels, this already lead to the interesting result that with Prodigy 1.8x batch.train recipe the overall accuracy climbed from 76.5 % to round about 90%.

So I ran another experiment:

First I trained the de_core_news_sm model with the gold annotations as described above.

Then I trained the resulting model again with the gold annotations converted to silver annotations (plus the additional fake rejects) without adding any additional training data for FAC and GEO.

With Prodigy 1.8x batch.train recipe the overall accuracy again climbed from 76.5 % to round about 90%.

With Prodigy 1.9x and the train ner recipe this improvement is even more drastic:

First go with

train ner de_ner_gold de_core_news_sm --output model_ner_gold 
--eval-split 0.2 --n-iter 50 --batch-size 32

Results (as above):

Label      Precision   Recall   F-Score
--------   ---------   ------   -------
ORG           86.443   85.432    85.935
PRODUCT       83.897   85.572    84.726
EVENT         79.348   66.364    72.277
FAC           58.696   50.943    54.545
LOC           92.209   88.634    90.386
PER           90.086   89.574    89.829
JOBTITLE      82.228   78.571    80.358
GEO           50.000   10.000    16.667


Best F-Score   86.177
Baseline       42.775

Second go with:

train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver 
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary

with the output:

Correct     6924 
Incorrect   406  
Baseline    0.944             
Accuracy    0.945

Alternate second go with (omitted --binary argument):

train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver 
--eval-split 0.2 --n-iter 20 --batch-size 32

with the output:

Label      Precision   Recall   F-Score
--------   ---------   ------   -------
LOC           98.462   97.907    98.183
PER           98.207   97.624    97.915
JOBTITLE      96.451   95.483    95.965
ORG           96.378   96.456    96.417
PRODUCT       95.085   95.942    95.512
EVENT         95.276   91.667    93.436
FAC           91.228   81.250    85.950
GEO          100.000   50.000    66.667


Best F-Score   96.668
Baseline       96.765       

Not sure how to interpret these results. Can training a model again with the same data really lead to such an improvement?

Final question for now:

When googling for „LOSS in NLP“ I found the following:

A loss function is going to serve as a measurement of how far our current set of predictions are from the corresponding true values. Some examples of loss functions that are commonly used in machine learning include: Mean-Squared-Error.

Only asking, because @honnibal stated somewhere - I think it was on this forum- that this value should decrease during training aiming for zero without ever reaching zero (IIRC).

Here is what I get, when training with Prodigy 1.95 and teach ner. The loss values with Prodigy 1.8x were equally high. Is that expected?

#    Loss       Precision   Recall     F-Score 
--   --------   ---------   --------   --------
 1   91049.41      75.148     64.773     69.576                                                                                                                                                                                                                                                                               
 2   84954.84      77.621     70.256     73.755                                                                                                                                                                                                                                                                               
 3   82770.17      79.641     74.460     76.964                                                                                                                                                                                                                                                                               
 4   81367.21      80.886     76.761     78.770                                                                                                                                                                                                                                                                               
 5   80145.44      81.918     78.253     80.044                                                                                                                                                                                                                                                                               

45   71295.62      86.974     85.355     86.157                                                                                                                                                                                                                                                                               
46   71881.20      86.804     85.312     86.052                                                                                                                                                                                                                                                                               
47   71647.44      86.876     85.284     86.073                                                                                                                                                                                                                                                                               
48   70608.79      86.934     85.341     86.130                                                                                                                                                                                                                                                                               
49   71308.20      86.965     85.384     86.167                                                                                                                                                                                                                                                                               
50   70625.16      86.956     85.412     86.177       

Sorry for the lengthy post and the many questions.

Kind regards, kamiwa

Hi Kamiwa,

All of your commands look correct, and I'm glad the v1.9 changes are working well for you.

I'm suspicious about your final experiments though: the accuracy improvements don't look right to me, but I don't immediately see what's going wrong. Are you sure the evaluation isn't changing also? If you just look at the model outputs, does the improvement seem as obvious as the figures suggest? Running the A/B evaluation on a completely fresh set of text should be informative.

Regarding the loss figures, those are aggregate numbers rather than per-batch numbers, which is why they look high. I wouldn't worry about it --- the only interesting thing is whether they decrease.

@honnibal: Thank your for the reply and sorry for taking so long to reply. Currently busy with another project. Will have to retrain my models anyway. Will then do some more testing and post the results here.

Hi Honnibal,

I finally found time to do some more testing and I think I found an explanation for the apparent model improvements shown in my previous post.

Following the recommendations from your reply to my post „Is there something wrong in general with the German model?“ I am using gold standard annotations during the first batch training round.

During the second round, I am using silver standard annotations.

These silver standard annotations were created, by splitting the original gold annotation sentences into one annotation per each annotated span.

Example:

The gold standard annotation:

Accept: Donald Trump (PER) was inaugurated as the 45th president (JOBTITLE ) of the United States (LOC), succeeding Barack Obama (PER).

is split into four silver standard annotations:

Accept: Donald Trump (PER) was inaugurated as the 45th president of the United States, succeeding Barack Obama.  
Accept: Donald Trump was inaugurated as the 45th president (JOBTITLE ) of the United States, succeeding Barack Obama.  
Accept: Donald Trump was inaugurated as the 45th president of the United States (LOC), succeeding Barack Obama.  
Accept: Donald Trump was inaugurated as the 45th president of the United States, succeeding Barack Obama (PER).  

As this method would lead to a dataset that only contains „accept“ annotations in the dataset I then added four artificial „reject“ annotations:

Reject: Donald Trump (ORG) was inaugurated as the 45th president of the United States, succeeding Barack Obama.  
Reject: Donald Trump was inaugurated as the 45th president (LOC ) of the United States, succeeding Barack Obama.  
Reject: Donald Trump was inaugurated as the 45th president of the United States (JOBTITLE), succeeding Barack Obama.  
Reject: Donald Trump was inaugurated as the 45th president of the United States, succeeding Barack Obama (ORG).  

So when re-training the model that resulted from the first batch training with these silver annotations the batch train process has the additional information from the reject annotations.

This additional information apparently leads to an improvement of the model.

As recommend in your previous reply, I now have compared the resulting two models using the ner.eval-ab recipe. From what I can see the second model (B) is performing slightly better than the first (A). But the results are still far away from the accuracy and precision that the resulting figures from the second batch training suggest.

So in order to find out what is going on, I did another experiment:

In the first round I again trained the shipped de_core_news_sm model with the gold standard annotations using the following recipe:

pgy train ner de_ner_gold de_core_news_sm --output de_ner_gold --eval-split 0.2 --n-iter 50 --batch-size 32

Here the results:

✔ Loaded model 'de_core_news_sm'
Created and merged data for 15641 total examples
Using 12513 train / 3128 eval (split 20%)
Component: ner | Batch size: 32 | Dropout: 0.2 | Iterations: 50
ℹ Baseline accuracy: 44.479

……

Label      Precision   Recall   F-Score
--------   ---------   ------   -------
ORG           88.262   86.272    87.256
PRODUCT       83.364   87.087    85.185
LOC           90.835   89.615    90.221
JOBTITLE      83.400   81.332    82.353
PER           89.546   91.458    90.492
FAC           52.941   36.486    43.200
EVENT         81.443   68.103    74.178
GEO           50.000   15.385    23.529


Best F-Score   86.845
Baseline       44.479      

I then used the resulting model and this time trained it again with the exact same dataset and exact same recipe from the first round:

pgy train ner de_ner_gold de_ner_gold --output de_ner_gold_round_2 --eval-split 0.2 --n-iter 50 --batch-size 32

Results:

✔ Loaded model 'de_ner_gold'
Created and merged data for 15641 total examples
Using 12513 train / 3128 eval (split 20%)
Component: ner | Batch size: 32 | Dropout: 0.2 | Iterations: 50
ℹ Baseline accuracy: 96.809

Interestingly the recipe claims that the baseline accuracy is now 96.809 although none of the results (Precision, Recall, F-Score per entity and Best F-Score) from the first round are anywhere near this value.

Again the results from this second round suggest a dramatic improvement:

Label      Precision   Recall   F-Score
--------   ---------   ------   -------
ORG           97.120   96.845    96.982
PER           98.194   97.585    97.889
JOBTITLE      95.409   96.974    96.185
LOC           97.544   97.288    97.416
PRODUCT       95.200   95.796    95.497
EVENT         93.077   93.798    93.436
FAC           87.692   87.692    87.692
GEO           90.000   69.231    78.261


Best F-Score   96.735
Baseline       96.809             

When comparing the resulting two models with the ner.eval-ab recipe I see a slight improvement in model B.
But again these improvement are definitely nowhere near an 96.735 F-Score.