Hello,
i have build a ner model with many custom labels, it works really well. The training is based on a 300k sentences corpus.
Now the problem is that i have to add another label but the annotated sentences are ~7k, so very unbalanced.
The points are two:
Reduce the annotated sentences from 300k to 7k, to have a balanced distribution of the labels, but, in this case it will decrease the accuracy of the labels that were in the previous 300k sentences.
Train the model with the first labels on the 300k sentences (so good accuracy as i wrote before) and then update the model with the new label on only 7k sentences (that basically are first 7k sentences of the 300k i previously mentioned), but in this case the new label will have poor weight i think…
The first thing to try is obviously the simplest, so: what happens if you just add the new annotations and retrain? You can try a few things like upsampling the sentences with the rare classes, but collecting more annotations for them is likely to be a more effective approach. Using pattern rules can also help, if you’re finding that phrases that match these entities exactly still aren’t being recognised by the model.
Finally, I’ve only just thought of this so maybe it’s not effective, but you could try training a text classifier on to predict which sentences contain at least one example of the entity you’re trying to annotate more of. You could then use this text classifier to select sentences that will be better targets for annotation.