I'm looking for tips on how to effectively handle the following scenario, as I'm kind of groping in the dark.
I want to train NER on a totally novel set of entity classes. Let's call them types A, B and C. (Sorry for not being more specific, I am under NDA.) The challenge is that the type distribution over the data is unbalanced. I have an annotated dataset of about 7000 entities over about 5000 lines of text. Roughly 60% of these entities are type A, 35% are type B, but type C entities represent only 5%.
Pilot experiments (prodigy.train starting from en-core-web-trf with all default settings) show encouraging results for types A and B, but pretty bad for C, as might be expected. It does not help that C is more ambiguous to annotate than A or B. C cannot be easily covered by pattern rules.
I have the means to annotate more data, but what are some good strategies to apply here?
Should I artificially balance the dataset to inflate the number of type C entities? (Then I would have to be careful not to introduce more type A or B, as types co-occur in sentences, and that would make the dataset look quite skewed.)
I believe active learning can help me, but I don't fully understand the prodigy implementation. I was thinking of using ner.teach to annotate specifically type C labels on new data, but then is it fruitful to retrain a model trained on A, B and C on this new C data? Is there a risk of losing accuracy on the A and B labels in that scenario?