Hi. A bit confused here. The new --ner-missing argument on the train ner recipe is that the same as the old --no-missing argument? Or is it the exact opposite?
Yes, it's the opposite: --no-missing
means there are no missing values, --ner-missing
means that the NER data contains missing values (and was created using a binary recipe etc.). Also see the argument description here.
The previous ner.batch-train
recipe had a stronger focus on training from binary incomplete annotations. The new train
recipe is a more general-purpose training recipe that assumes annotations are complete by default, which is more consistent. If unannotated tokens should be treated as unknown/missing values, you can explicitly set --ner-missing
.
Hi @ines, thanks for the fast reply. I had looked at the documentation. The text from the documentation was the source of my confusion
Your explanation here now made it perfectly clear to me.
If it is not asked too much, could you please drop a few more lines to explain when exactly the --binary argument is needed in ner training and could you please explain, why training ner with this option prints the overall accuracy for the trained model and not the per entity recall, precision and f-score values?
Thank you very much in advance for your patience and effort.
In a typical training scenario, you're updating a model with examples and the correct answer – e.g. with a text and the entities in it. In some cases you may also have partial annotations: you know some entities but not all.
Prodigy's active learning recipes like ner.teach
also let you collect binary yes/no decisions. The data you create here is different again: for some spans, you know that they are entities, because you accepted them. For the ones you rejected, you know that they're not of type X – but they could potentially be something else. This requires a different way of updating the model: you want to update with the positive examples where you know the answer, and proportionally with the "negative" example where you only know that a certain label doesn't apply. That's the type of training the --binary
flag enables.
Fine-grained per-label accuracy is a very new feature in spaCy, so we only just added that to the regular training recipe in Prodigy in v1.9. The binary training requires very different evaluation (for the reasons explained above), so if we wanted more fine-grainde accuracy, we'd have to come up with our own implementation and logic for it. It's also not clear if it translates well and makes it easier to reason about the results.
Thank you so much @ines.
Let’s see whether I understood you correctly:
I’m trying to tweak NER in the German de_core_news_sm
model to my needs. I’ve collected 16,000 gold annotated sentences using Prodigy 1.8x and the „old“ ner.make-gold
recipe with the following arguments:
ner.make-gold de_ner_gold de_core_news_sm sentences.jsonl
--label labels.txt --unsegmented
labels.txt
contains the following labels:
The already existing labels LOC
(keeping Countries, cities and states), ORG
, PER
plus five new labels JOBTITLE
, PRODUCT
, FAC
(facilities: Buildings, airports, highways, bridges, etc.), GEO
(Non-GPE locations, mountain ranges, bodies of water.), EVENT
.
MISC
was deliberately ignored as it is considered not useful.
In Prodigy 1.9x ner.make-gold
is now deprecated. So my command for collecting the gold annotations would translate to:
ner.correct de_ner_gold de_core_news_sm sentences.jsonl
--label labels.txt --unsegmented
Correct?
The de_core_news_sm
model was then trained using Prodigy 1.8x and the „old“ ner.batch-train
recipe with the following arguments:
ner.batch-train de_ner_gold de_core_news_sm --output model_ner_gold
--label labels.txt --eval-split 0.2 --n-iter 50 --batch-size 32
--unsegmented --no-missing
The --no-missing
argument was used because @honnibal recommended using it in his answer to my earlier post Is there something wrong in general with the German model? and this indeed improved accuracy quit a bit. The model trained with this argument achieved an overall accuracy of 76.5 %.
In Prodigy 1.9x ner.batch-train
is now deprecated. So my command for training the de_core_news_sm
model would translate to:
train ner de_ner_gold de_core_news_sm --output model_ner_gold
--eval-split 0.2 --n-iter 50 --batch-size 32
The arguments --label
, --unsegmented
and --no-missing
are no longer supported.
--label
is no longer needed as the train ner
recipe detects the labels from the annotation.
--unsegmented
is not needed as the train ner
recipe uses the text from the annotation as is.
--ner-missing
ist not needed as it is the contrary of --no-missing
. --no-missing
is now the default behavior in NER training as @ines explained above.
Correct?
When using the above statement, I get the following result from training:
Label Precision Recall F-Score
-------- --------- ------ -------
ORG 86.443 85.432 85.935
PRODUCT 83.897 85.572 84.726
EVENT 79.348 66.364 72.277
FAC 58.696 50.943 54.545
LOC 92.209 88.634 90.386
PER 90.086 89.574 89.829
JOBTITLE 82.228 78.571 80.358
GEO 50.000 10.000 16.667
Best F-Score 86.177
Baseline 42.775
All together I’m very pleased with the much more informative and detailed per label output of the results. Although these results don’t allow a direct comparison with the „old“ overall accuracy value, I can see that the results from the new train ner
recipe have greatly improved, as an overall F-Score
of 86.177 is clearly better than an overall accuracy of 76.5 %. Isn't it?
As the training data hasn’t changed, I can only assume that these improved results either come from the new German spaCy model supplied with spaCy v2.2
or the improved train ner
recipe or a combination of the two.
Anyhow: These results additionally clearly indicate that especially the FAC
and GEO
labels need further training, a fact that I already had found out by testing and evaluating the model after it was trained with Prodigy 1.8x and the ner.batch-train
recipe.
So in order to improve the model I had collected further examples using Prodigy 1.8x and the ner.teach
recipe.
Here is what I used to annotate:
ner.teach de_ner_silver model_ner_gold de_more_sentences.jsonl
--label labels.txt --unsegmented
labels.txt
contained the same labels as above.
In Prodigy 1.9x this recipe is unchanged. Correct?
And here is what I used to train my model in Prodigy 1.8x:
ner.batch-train de_ner_silver model_ner_gold
--output model_ner_silver
--label labels.txt --eval-split 0.20 --n-iter 20 --batch-size 32
--unsegmented
In Prodigy 1.9x this would translate to:
train ner de_ner_silver model_ner_gold --output model_ner_silver
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary --ner-missing
Correct?
What I had noticed after training was that the model completely "forgot" about the the MISC
label.
In my case this was actually what I wanted because the MISC
label wasn’t very useful for me anyway. Still I wanted to know, why that is and googling led me to an article about the „catastrophic forgetting problem“: Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP · Explosion.
After reading @honnibal’s article I wasn’t sure, whether this problem would only occur if a certain label isn’t contained in the training data at all or whether missing important examples for a label can lead to a partial forgetting inside this label. So I decided to split my gold examples into silver annotations and add these to the annotations that I created using the ner.teach
recipe.
Additionally I programmatically added fake rejects to the annotations in order to have an equal amount of accepts and rejects per entity. The fake rejects where created on the idea that a person cannot be an organization, an organization cannot be a location, a location cannot be an event, and so on. So per each accept for a span I added a reject for another label type.
After adding only a few more example for the LOC
and FAC
labels, this already lead to the interesting result that with Prodigy 1.8x batch.train
recipe the overall accuracy climbed from 76.5 % to round about 90%.
So I ran another experiment:
First I trained the de_core_news_sm
model with the gold annotations as described above.
Then I trained the resulting model again with the gold annotations converted to silver annotations (plus the additional fake rejects) without adding any additional training data for FAC
and GEO
.
With Prodigy 1.8x batch.train
recipe the overall accuracy again climbed from 76.5 % to round about 90%.
With Prodigy 1.9x and the train ner
recipe this improvement is even more drastic:
First go with
train ner de_ner_gold de_core_news_sm --output model_ner_gold
--eval-split 0.2 --n-iter 50 --batch-size 32
Results (as above):
Label Precision Recall F-Score
-------- --------- ------ -------
ORG 86.443 85.432 85.935
PRODUCT 83.897 85.572 84.726
EVENT 79.348 66.364 72.277
FAC 58.696 50.943 54.545
LOC 92.209 88.634 90.386
PER 90.086 89.574 89.829
JOBTITLE 82.228 78.571 80.358
GEO 50.000 10.000 16.667
Best F-Score 86.177
Baseline 42.775
Second go with:
train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary
with the output:
Correct 6924
Incorrect 406
Baseline 0.944
Accuracy 0.945
Alternate second go with (omitted --binary
argument):
train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver
--eval-split 0.2 --n-iter 20 --batch-size 32
with the output:
Label Precision Recall F-Score
-------- --------- ------ -------
LOC 98.462 97.907 98.183
PER 98.207 97.624 97.915
JOBTITLE 96.451 95.483 95.965
ORG 96.378 96.456 96.417
PRODUCT 95.085 95.942 95.512
EVENT 95.276 91.667 93.436
FAC 91.228 81.250 85.950
GEO 100.000 50.000 66.667
Best F-Score 96.668
Baseline 96.765
Not sure how to interpret these results. Can training a model again with the same data really lead to such an improvement?
Final question for now:
When googling for „LOSS in NLP“ I found the following:
A loss function is going to serve as a measurement of how far our current set of predictions are from the corresponding true values. Some examples of loss functions that are commonly used in machine learning include: Mean-Squared-Error.
Only asking, because @honnibal stated somewhere - I think it was on this forum- that this value should decrease during training aiming for zero without ever reaching zero (IIRC).
Here is what I get, when training with Prodigy 1.95 and teach ner
. The loss values with Prodigy 1.8x were equally high. Is that expected?
# Loss Precision Recall F-Score
-- -------- --------- -------- --------
1 91049.41 75.148 64.773 69.576
2 84954.84 77.621 70.256 73.755
3 82770.17 79.641 74.460 76.964
4 81367.21 80.886 76.761 78.770
5 80145.44 81.918 78.253 80.044
45 71295.62 86.974 85.355 86.157
46 71881.20 86.804 85.312 86.052
47 71647.44 86.876 85.284 86.073
48 70608.79 86.934 85.341 86.130
49 71308.20 86.965 85.384 86.167
50 70625.16 86.956 85.412 86.177
Sorry for the lengthy post and the many questions.
Kind regards, kamiwa
Hi Kamiwa,
All of your commands look correct, and I'm glad the v1.9 changes are working well for you.
I'm suspicious about your final experiments though: the accuracy improvements don't look right to me, but I don't immediately see what's going wrong. Are you sure the evaluation isn't changing also? If you just look at the model outputs, does the improvement seem as obvious as the figures suggest? Running the A/B evaluation on a completely fresh set of text should be informative.
Regarding the loss figures, those are aggregate numbers rather than per-batch numbers, which is why they look high. I wouldn't worry about it --- the only interesting thing is whether they decrease.
@honnibal: Thank your for the reply and sorry for taking so long to reply. Currently busy with another project. Will have to retrain my models anyway. Will then do some more testing and post the results here.
Hi Honnibal,
I finally found time to do some more testing and I think I found an explanation for the apparent model improvements shown in my previous post.
Following the recommendations from your reply to my post „Is there something wrong in general with the German model?“ I am using gold standard annotations during the first batch training round.
During the second round, I am using silver standard annotations.
These silver standard annotations were created, by splitting the original gold annotation sentences into one annotation per each annotated span.
Example:
The gold standard annotation:
Accept: Donald Trump (PER) was inaugurated as the 45th president (JOBTITLE ) of the United States (LOC), succeeding Barack Obama (PER).
is split into four silver standard annotations:
Accept: Donald Trump (PER) was inaugurated as the 45th president of the United States, succeeding Barack Obama.
Accept: Donald Trump was inaugurated as the 45th president (JOBTITLE ) of the United States, succeeding Barack Obama.
Accept: Donald Trump was inaugurated as the 45th president of the United States (LOC), succeeding Barack Obama.
Accept: Donald Trump was inaugurated as the 45th president of the United States, succeeding Barack Obama (PER).
As this method would lead to a dataset that only contains „accept“ annotations in the dataset I then added four artificial „reject“ annotations:
Reject: Donald Trump (ORG) was inaugurated as the 45th president of the United States, succeeding Barack Obama.
Reject: Donald Trump was inaugurated as the 45th president (LOC ) of the United States, succeeding Barack Obama.
Reject: Donald Trump was inaugurated as the 45th president of the United States (JOBTITLE), succeeding Barack Obama.
Reject: Donald Trump was inaugurated as the 45th president of the United States, succeeding Barack Obama (ORG).
So when re-training the model that resulted from the first batch training with these silver annotations the batch train process has the additional information from the reject annotations.
This additional information apparently leads to an improvement of the model.
As recommend in your previous reply, I now have compared the resulting two models using the ner.eval-ab
recipe. From what I can see the second model (B) is performing slightly better than the first (A). But the results are still far away from the accuracy and precision that the resulting figures from the second batch training suggest.
So in order to find out what is going on, I did another experiment:
In the first round I again trained the shipped de_core_news_sm
model with the gold standard annotations using the following recipe:
pgy train ner de_ner_gold de_core_news_sm --output de_ner_gold --eval-split 0.2 --n-iter 50 --batch-size 32
Here the results:
✔ Loaded model 'de_core_news_sm'
Created and merged data for 15641 total examples
Using 12513 train / 3128 eval (split 20%)
Component: ner | Batch size: 32 | Dropout: 0.2 | Iterations: 50
ℹ Baseline accuracy: 44.479
……
Label Precision Recall F-Score
-------- --------- ------ -------
ORG 88.262 86.272 87.256
PRODUCT 83.364 87.087 85.185
LOC 90.835 89.615 90.221
JOBTITLE 83.400 81.332 82.353
PER 89.546 91.458 90.492
FAC 52.941 36.486 43.200
EVENT 81.443 68.103 74.178
GEO 50.000 15.385 23.529
Best F-Score 86.845
Baseline 44.479
I then used the resulting model and this time trained it again with the exact same dataset and exact same recipe from the first round:
pgy train ner de_ner_gold de_ner_gold --output de_ner_gold_round_2 --eval-split 0.2 --n-iter 50 --batch-size 32
Results:
✔ Loaded model 'de_ner_gold'
Created and merged data for 15641 total examples
Using 12513 train / 3128 eval (split 20%)
Component: ner | Batch size: 32 | Dropout: 0.2 | Iterations: 50
ℹ Baseline accuracy: 96.809
Interestingly the recipe claims that the baseline accuracy is now 96.809 although none of the results (Precision
, Recall
, F-Score
per entity and Best F-Score
) from the first round are anywhere near this value.
Again the results from this second round suggest a dramatic improvement:
Label Precision Recall F-Score
-------- --------- ------ -------
ORG 97.120 96.845 96.982
PER 98.194 97.585 97.889
JOBTITLE 95.409 96.974 96.185
LOC 97.544 97.288 97.416
PRODUCT 95.200 95.796 95.497
EVENT 93.077 93.798 93.436
FAC 87.692 87.692 87.692
GEO 90.000 69.231 78.261
Best F-Score 96.735
Baseline 96.809
When comparing the resulting two models with the ner.eval-ab
recipe I see a slight improvement in model B.
But again these improvement are definitely nowhere near an 96.735 F-Score
.