Best practices for NER annotation to avoid overfitting

This is my usual practice -

1.) annotate a bunch of data points -> gets saved in sql db with name ner_gold
2.) train modelv1 (on ner_gold)
3.) look for mistakes in predictions on new data points-> correct annotations -> gets saved in sql db in ner_gold (this way my ner_gold keeps getting bigger)
4.) train modelv2 (again on ner_gold)


My question is -> am i supposed to save my dataset into new paths (ner_gold_v1, ner_gold_v2) etc. If I am retraining same model for improvement to avoid overfitting? or does prodigy take care of this somehow?

Hi! In general, your workflow sounds good, and you're always creating complete, gold-standard annotations, are using new data and aren't annotating the same examples twice, right?

Prodigy can take care of merging annotations on the same text, so it's no problem to annotate labels separately. You just want to make sure you don't end up with multiple, potentially conflicting versions of the same example in the dataset you're training from. And you typically also want to train a model from scratch, using the same full corpus (instead of updating the same artifact over and over again, which often makes it much harder to avoid overfitting and forgetting effects).

Are you using the modelv1 here as the base model, or are you starting off with a blank model (or basically, the same you used when you trained the first version of your model)? I'd definitely recommend starting fresh everytime you train on the whole ner_gold data.

yeah I am using the new data points (no I don't have any duplicates in my gold dataset) to continuously improve the model from my previous iteration i.e I'm not training from scratch each time.

If I was to prefer not training from scratch then I guess I should maintain separate datasets for each iteration? Is there a preference one way or the other? (train from scratch on gold dataset or iteratively improve the artifacts with a few hundred additional datapoints in each iteration)

If you really wanted to do that, then yes, you probably want separate datasets and make sure that each set you're using for incremental updates is more or less representative. Like, you wouldn't want one set to only include label A, and the next one to only include label B, or something like that.

But if possible, I'd definitely recommend training from scratch on the full set. This also makes it much easier to reason about the results.