Unevenly spread labels - does it affect the suggestions made?

Arul · November 8, 2018, 7:08pm

In the training data, the distribution of set of labels is very skewed. i have one of the labels occuring about 50% and the lowest goes to 0.02% - which is hardly anything. I am going to use ner-teach for improving the current model. Does the suggestion of lables take into account the distribution of labels? does it suggest something for the 0.02 first (along with the considered uncertainty?)
I would like to think of a way to make the spread better in course of ner-teaching. But speaking of the corpus and domain the labels are not really evenly spread, though it is not this skewed.

honnibal · November 8, 2018, 7:46pm

If you can find a way to not need the very skewed label distribution, that will likely make your problem a lot easier to annotate and learn. For instance, is there a more common category, where you can use that label and a terminology list to identify your rare category?

If you must have the entity recogniser work on the rare category, you’ll probably be best off creating a custom recipe, with logic that uses some sort of information-retrieval approach to give you a reasonable number of candidates to annotate for that class.

In theory the uncertainty sampling would prefer entities of the rare class, upsampling it in the annotation queue. In practice however, we don’t want to assume the probabilities produced by the model are too well calibrated — after all, the model’s accuracy might not be high during training. We therefore smooth the model’s probabilities, so there’s not really much difference in how we handle a score of 0.01 and a score of 0.001.

Arul · November 12, 2018, 8:21pm

Thank you for the detailed explanation

Topic		Replies	Views
Help with messy data usage , ner	8	666	January 20, 2019
Advice wanted: NER with novel types and an unbalanced dataset usage , ner	2	370	November 2, 2021
prefer_uncertain in ner.teach? docs , ner	2	1108	September 7, 2017
NER Teach has lower accuracy for other labels usage , ner	1	413	April 16, 2019
default aktive learning ner , api	1	555	February 15, 2018

Unevenly spread labels - does it affect the suggestions made?

Related topics