handle imbalance in named entity recognition

sampathkumaran · August 15, 2021, 1:24pm

How to handle the data imbalance in named entity recognition in spacy? Example: label1 has 5000 annotations, label2 has 500 annotations and label3 has 50 annotations.

ines · August 17, 2021, 1:23am

Hi! What exactly is it that you're trying to solve here? Do you see worse predictions for labels that have fewer examples?

In general, entity labels can always be somewhat imbalanced, that's kind of inherent to how language works. It's fine if your data reflects that, and you don't necessarily need the exact same number of annotations for all labels. However, you should make sure that you have enough examples in different contexts that the model can learn from. So if your label with only 50 annotations is performing worse, you might want to collect more examples for that label.

You could use keyword or pattern matches to pre-select examples that are more likely to contain that label, so you don't have to go through too many irrelevant examples. It can also be helpful to look at the instances of that label that the model is getting wrong, and see if you can make adjustments to your label scheme. If your rare label is also slightly ambiguous, or difficult to annotate consistently, you may see worse results that are amplified by the fact that you only have very few examples. So in that case, restructuring your label scheme or correcting inconsistent annotations can also help.

Topic		Replies	Views
How to train a NER model with unbalanced entities? usage , ner	1	1295	May 11, 2019
80 Entities ner.manual usage , ner , solved	7	805	August 15, 2021
Imbalanced data suggestions - NER usage , ner	6	762	May 27, 2022
Missing entity result ner , solved	7	919	August 29, 2022
Unevenly spread labels - does it affect the suggestions made? ner , solved	2	392	November 12, 2018

handle imbalance in named entity recognition

Related topics