Thank you so much @ines.
Let’s see whether I understood you correctly:
I’m trying to tweak NER in the German de_core_news_sm
model to my needs. I’ve collected 16,000 gold annotated sentences using Prodigy 1.8x and the „old“ ner.make-gold
recipe with the following arguments:
ner.make-gold de_ner_gold de_core_news_sm sentences.jsonl
--label labels.txt --unsegmented
labels.txt
contains the following labels:
The already existing labels LOC
(keeping Countries, cities and states), ORG
, PER
plus five new labels JOBTITLE
, PRODUCT
, FAC
(facilities: Buildings, airports, highways, bridges, etc.), GEO
(Non-GPE locations, mountain ranges, bodies of water.), EVENT
.
MISC
was deliberately ignored as it is considered not useful.
In Prodigy 1.9x ner.make-gold
is now deprecated. So my command for collecting the gold annotations would translate to:
ner.correct de_ner_gold de_core_news_sm sentences.jsonl
--label labels.txt --unsegmented
Correct?
The de_core_news_sm
model was then trained using Prodigy 1.8x and the „old“ ner.batch-train
recipe with the following arguments:
ner.batch-train de_ner_gold de_core_news_sm --output model_ner_gold
--label labels.txt --eval-split 0.2 --n-iter 50 --batch-size 32
--unsegmented --no-missing
The --no-missing
argument was used because @honnibal recommended using it in his answer to my earlier post Is there something wrong in general with the German model? and this indeed improved accuracy quit a bit. The model trained with this argument achieved an overall accuracy of 76.5 %.
In Prodigy 1.9x ner.batch-train
is now deprecated. So my command for training the de_core_news_sm
model would translate to:
train ner de_ner_gold de_core_news_sm --output model_ner_gold
--eval-split 0.2 --n-iter 50 --batch-size 32
The arguments --label
, --unsegmented
and --no-missing
are no longer supported.
--label
is no longer needed as the train ner
recipe detects the labels from the annotation.
--unsegmented
is not needed as the train ner
recipe uses the text from the annotation as is.
--ner-missing
ist not needed as it is the contrary of --no-missing
. --no-missing
is now the default behavior in NER training as @ines explained above.
Correct?
When using the above statement, I get the following result from training:
Label Precision Recall F-Score
-------- --------- ------ -------
ORG 86.443 85.432 85.935
PRODUCT 83.897 85.572 84.726
EVENT 79.348 66.364 72.277
FAC 58.696 50.943 54.545
LOC 92.209 88.634 90.386
PER 90.086 89.574 89.829
JOBTITLE 82.228 78.571 80.358
GEO 50.000 10.000 16.667
Best F-Score 86.177
Baseline 42.775
All together I’m very pleased with the much more informative and detailed per label output of the results. Although these results don’t allow a direct comparison with the „old“ overall accuracy value, I can see that the results from the new train ner
recipe have greatly improved, as an overall F-Score
of 86.177 is clearly better than an overall accuracy of 76.5 %. Isn't it?
As the training data hasn’t changed, I can only assume that these improved results either come from the new German spaCy model supplied with spaCy v2.2
or the improved train ner
recipe or a combination of the two.
Anyhow: These results additionally clearly indicate that especially the FAC
and GEO
labels need further training, a fact that I already had found out by testing and evaluating the model after it was trained with Prodigy 1.8x and the ner.batch-train
recipe.
So in order to improve the model I had collected further examples using Prodigy 1.8x and the ner.teach
recipe.
Here is what I used to annotate:
ner.teach de_ner_silver model_ner_gold de_more_sentences.jsonl
--label labels.txt --unsegmented
labels.txt
contained the same labels as above.
In Prodigy 1.9x this recipe is unchanged. Correct?
And here is what I used to train my model in Prodigy 1.8x:
ner.batch-train de_ner_silver model_ner_gold
--output model_ner_silver
--label labels.txt --eval-split 0.20 --n-iter 20 --batch-size 32
--unsegmented
In Prodigy 1.9x this would translate to:
train ner de_ner_silver model_ner_gold --output model_ner_silver
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary --ner-missing
Correct?
What I had noticed after training was that the model completely "forgot" about the the MISC
label.
In my case this was actually what I wanted because the MISC
label wasn’t very useful for me anyway. Still I wanted to know, why that is and googling led me to an article about the „catastrophic forgetting problem“: Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP · Explosion.
After reading @honnibal’s article I wasn’t sure, whether this problem would only occur if a certain label isn’t contained in the training data at all or whether missing important examples for a label can lead to a partial forgetting inside this label. So I decided to split my gold examples into silver annotations and add these to the annotations that I created using the ner.teach
recipe.
Additionally I programmatically added fake rejects to the annotations in order to have an equal amount of accepts and rejects per entity. The fake rejects where created on the idea that a person cannot be an organization, an organization cannot be a location, a location cannot be an event, and so on. So per each accept for a span I added a reject for another label type.
After adding only a few more example for the LOC
and FAC
labels, this already lead to the interesting result that with Prodigy 1.8x batch.train
recipe the overall accuracy climbed from 76.5 % to round about 90%.
So I ran another experiment:
First I trained the de_core_news_sm
model with the gold annotations as described above.
Then I trained the resulting model again with the gold annotations converted to silver annotations (plus the additional fake rejects) without adding any additional training data for FAC
and GEO
.
With Prodigy 1.8x batch.train
recipe the overall accuracy again climbed from 76.5 % to round about 90%.
With Prodigy 1.9x and the train ner
recipe this improvement is even more drastic:
First go with
train ner de_ner_gold de_core_news_sm --output model_ner_gold
--eval-split 0.2 --n-iter 50 --batch-size 32
Results (as above):
Label Precision Recall F-Score
-------- --------- ------ -------
ORG 86.443 85.432 85.935
PRODUCT 83.897 85.572 84.726
EVENT 79.348 66.364 72.277
FAC 58.696 50.943 54.545
LOC 92.209 88.634 90.386
PER 90.086 89.574 89.829
JOBTITLE 82.228 78.571 80.358
GEO 50.000 10.000 16.667
Best F-Score 86.177
Baseline 42.775
Second go with:
train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver
--eval-split 0.2 --n-iter 20 --batch-size 32 --binary
with the output:
Correct 6924
Incorrect 406
Baseline 0.944
Accuracy 0.945
Alternate second go with (omitted --binary
argument):
train ner de_ner_gold_as_silver model_ner_gold --output model_ner_silver
--eval-split 0.2 --n-iter 20 --batch-size 32
with the output:
Label Precision Recall F-Score
-------- --------- ------ -------
LOC 98.462 97.907 98.183
PER 98.207 97.624 97.915
JOBTITLE 96.451 95.483 95.965
ORG 96.378 96.456 96.417
PRODUCT 95.085 95.942 95.512
EVENT 95.276 91.667 93.436
FAC 91.228 81.250 85.950
GEO 100.000 50.000 66.667
Best F-Score 96.668
Baseline 96.765
Not sure how to interpret these results. Can training a model again with the same data really lead to such an improvement?
Final question for now:
When googling for „LOSS in NLP“ I found the following:
A loss function is going to serve as a measurement of how far our current set of predictions are from the corresponding true values. Some examples of loss functions that are commonly used in machine learning include: Mean-Squared-Error.
Only asking, because @honnibal stated somewhere - I think it was on this forum- that this value should decrease during training aiming for zero without ever reaching zero (IIRC).
Here is what I get, when training with Prodigy 1.95 and teach ner
. The loss values with Prodigy 1.8x were equally high. Is that expected?
# Loss Precision Recall F-Score
-- -------- --------- -------- --------
1 91049.41 75.148 64.773 69.576
2 84954.84 77.621 70.256 73.755
3 82770.17 79.641 74.460 76.964
4 81367.21 80.886 76.761 78.770
5 80145.44 81.918 78.253 80.044
45 71295.62 86.974 85.355 86.157
46 71881.20 86.804 85.312 86.052
47 71647.44 86.876 85.284 86.073
48 70608.79 86.934 85.341 86.130
49 71308.20 86.965 85.384 86.167
50 70625.16 86.956 85.412 86.177
Sorry for the lengthy post and the many questions.
Kind regards, kamiwa