Advice wanted: NER with novel types and an unbalanced dataset

SandstoneGolem · October 11, 2021, 9:51am

Hi,

I'm looking for tips on how to effectively handle the following scenario, as I'm kind of groping in the dark.

I want to train NER on a totally novel set of entity classes. Let's call them types A, B and C. (Sorry for not being more specific, I am under NDA.) The challenge is that the type distribution over the data is unbalanced. I have an annotated dataset of about 7000 entities over about 5000 lines of text. Roughly 60% of these entities are type A, 35% are type B, but type C entities represent only 5%.

Pilot experiments (prodigy.train starting from en-core-web-trf with all default settings) show encouraging results for types A and B, but pretty bad for C, as might be expected. It does not help that C is more ambiguous to annotate than A or B. C cannot be easily covered by pattern rules.

I have the means to annotate more data, but what are some good strategies to apply here?

Should I artificially balance the dataset to inflate the number of type C entities? (Then I would have to be careful not to introduce more type A or B, as types co-occur in sentences, and that would make the dataset look quite skewed.)

I believe active learning can help me, but I don't fully understand the prodigy implementation. I was thinking of using ner.teach to annotate specifically type C labels on new data, but then is it fruitful to retrain a model trained on A, B and C on this new C data? Is there a risk of losing accuracy on the A and B labels in that scenario?

honnibal · October 22, 2021, 1:48pm

Hi @SandstoneGolem ,

Sorry for the delay getting to this. This type of problem is difficult, especially not knowing the specifics.

I think creating a biased dataset is often worth it here, because if the accuracy on the first two types starts to slip, you can easily look at that. Another thing to think about is whether types A and B are hurting the accuracy of type C. If you can train a model only on type C and get better accuracy, that can be a useful "lead" for you. You can have two NER models in your spaCy pipeline, or use the insight that there's an interaction between the accuracies to guide your data collection.

Finally, if type C is really more difficult and ambiguous to annotate, you could look at whether there are issues defining the boundaries. If so, sometimes you can get away with reframing the annotation task so that you only annotate a single trigger word, or so that you only annotate the whole text, and the task is "is this entity expressed in the text, anywhere". This might or might not be possible on your problem, though.

SandstoneGolem · November 2, 2021, 10:31am

Hi @honnibal

Thank you for the answer. You raise a good point saying that I can monitor performance on A and B.

I think I've settled on an experimental route to use separate models for [A+B] and C, and aggregate the results in some way towards the final output. It will definitely take some data engineering to see if we can make C less ambiguous or make the task simpler in some way.

Cheers and thanks for all the work on this amazing tool.

Topic		Replies	Views
Improve trained models with annotations usage , ner , training	3	521	September 20, 2021
Use Case Feasibility ner , spacy	1	468	July 28, 2019
ner.teach to silver to gold -- how to best leverage Prodigy's recipes usage , ner	2	1292	August 19, 2019
Work Flow for extending an NER model with new entity types ner , best-practices	1	1426	June 1, 2019
How to train a NER model with unbalanced entities? usage , ner	1	1295	May 11, 2019

Advice wanted: NER with novel types and an unbalanced dataset

Related topics