In the training data, the distribution of set of labels is very skewed. i have one of the labels occuring about 50% and the lowest goes to 0.02% - which is hardly anything. I am going to use ner-teach for improving the current model. Does the suggestion of lables take into account the distribution of labels? does it suggest something for the 0.02 first (along with the considered uncertainty?)
I would like to think of a way to make the spread better in course of ner-teaching. But speaking of the corpus and domain the labels are not really evenly spread, though it is not this skewed.
If you can find a way to not need the very skewed label distribution, that will likely make your problem a lot easier to annotate and learn. For instance, is there a more common category, where you can use that label and a terminology list to identify your rare category?
If you must have the entity recogniser work on the rare category, you’ll probably be best off creating a custom recipe, with logic that uses some sort of information-retrieval approach to give you a reasonable number of candidates to annotate for that class.
In theory the uncertainty sampling would prefer entities of the rare class, upsampling it in the annotation queue. In practice however, we don’t want to assume the probabilities produced by the model are too well calibrated — after all, the model’s accuracy might not be high during training. We therefore smooth the model’s probabilities, so there’s not really much difference in how we handle a score of 0.01 and a score of 0.001.
Thank you for the detailed explanation