Best practices for NER annotation to avoid overfitting

thalish · October 20, 2020, 6:23am

This is my usual practice -

1.) annotate a bunch of data points -> gets saved in sql db with name ner_gold
2.) train modelv1 (on ner_gold)
3.) look for mistakes in predictions on new data points-> correct annotations -> gets saved in sql db in ner_gold (this way my ner_gold keeps getting bigger)
4.) train modelv2 (again on ner_gold)

Repeat.

My question is -> am i supposed to save my dataset into new paths (ner_gold_v1, ner_gold_v2) etc. If I am retraining same model for improvement to avoid overfitting? or does prodigy take care of this somehow?

ines · October 20, 2020, 7:52am

Hi! In general, your workflow sounds good, and you're always creating complete, gold-standard annotations, are using new data and aren't annotating the same examples twice, right?

Prodigy can take care of merging annotations on the same text, so it's no problem to annotate labels separately. You just want to make sure you don't end up with multiple, potentially conflicting versions of the same example in the dataset you're training from. And you typically also want to train a model from scratch, using the same full corpus (instead of updating the same artifact over and over again, which often makes it much harder to avoid overfitting and forgetting effects).

Are you using the modelv1 here as the base model, or are you starting off with a blank model (or basically, the same you used when you trained the first version of your model)? I'd definitely recommend starting fresh everytime you train on the whole ner_gold data.

thalish · October 20, 2020, 10:57am

yeah I am using the new data points (no I don't have any duplicates in my gold dataset) to continuously improve the model from my previous iteration i.e I'm not training from scratch each time.

If I was to prefer not training from scratch then I guess I should maintain separate datasets for each iteration? Is there a preference one way or the other? (train from scratch on gold dataset or iteratively improve the artifacts with a few hundred additional datapoints in each iteration)

ines · October 21, 2020, 9:25am

If you really wanted to do that, then yes, you probably want separate datasets and make sure that each set you're using for incremental updates is more or less representative. Like, you wouldn't want one set to only include label A, and the next one to only include label B, or something like that.

But if possible, I'd definitely recommend training from scratch on the full set. This also makes it much easier to reason about the results.

Topic		Replies	Views
Difference in quality in make-gold vs trained model's annotations (and others) ner	1	600	August 10, 2018
Re-annotating records usage , database , streams	4	566	May 5, 2020
overwriting annotations ner	2	1243	May 28, 2018
Best way to re-label / re-annotate existing data based on condition ner	1	421	September 19, 2022
Prodigy asking me to label the same data multiple times ner	3	872	November 30, 2020

Best practices for NER annotation to avoid overfitting

Related topics