I'm looking for tips on how to effectively handle the following scenario, as I'm kind of groping in the dark.
I want to train NER on a totally novel set of entity classes. Let's call them types A, B and C. (Sorry for not being more specific, I am under NDA.) The challenge is that the type distribution over the data is unbalanced. I have an annotated dataset of about 7000 entities over about 5000 lines of text. Roughly 60% of these entities are type A, 35% are type B, but type C entities represent only 5%.
Pilot experiments (prodigy.train starting from en-core-web-trf with all default settings) show encouraging results for types A and B, but pretty bad for C, as might be expected. It does not help that C is more ambiguous to annotate than A or B. C cannot be easily covered by pattern rules.
I have the means to annotate more data, but what are some good strategies to apply here?
Should I artificially balance the dataset to inflate the number of type C entities? (Then I would have to be careful not to introduce more type A or B, as types co-occur in sentences, and that would make the dataset look quite skewed.)
I believe active learning can help me, but I don't fully understand the prodigy implementation. I was thinking of using ner.teach to annotate specifically type C labels on new data, but then is it fruitful to retrain a model trained on A, B and C on this new C data? Is there a risk of losing accuracy on the A and B labels in that scenario?
Hi @SandstoneGolem ,
Sorry for the delay getting to this. This type of problem is difficult, especially not knowing the specifics.
I think creating a biased dataset is often worth it here, because if the accuracy on the first two types starts to slip, you can easily look at that. Another thing to think about is whether types A and B are hurting the accuracy of type C. If you can train a model only on type C and get better accuracy, that can be a useful "lead" for you. You can have two NER models in your spaCy pipeline, or use the insight that there's an interaction between the accuracies to guide your data collection.
Finally, if type C is really more difficult and ambiguous to annotate, you could look at whether there are issues defining the boundaries. If so, sometimes you can get away with reframing the annotation task so that you only annotate a single trigger word, or so that you only annotate the whole text, and the task is "is this entity expressed in the text, anywhere". This might or might not be possible on your problem, though.
Thank you for the answer. You raise a good point saying that I can monitor performance on A and B.
I think I've settled on an experimental route to use separate models for [A+B] and C, and aggregate the results in some way towards the final output. It will definitely take some data engineering to see if we can make C less ambiguous or make the task simpler in some way.
Cheers and thanks for all the work on this amazing tool.