Best way to annotate rare labels for classification

usage
textcat

(🍁 Tal Weiss) #1

My text data has a label I’m trying to classify which is rare in the data ~1/1000 of the sentences.
I have good patterns which have about 10% precision but over 90% recall.
I can’t get textcat.teach to show me these samples. I’ve tries different setups, including a custom recipe sorting the stream using prefer_high_scores, prefer_low_scores and prefer_uncertain with ‘probability’ / ‘ema’. In all these scenarios Prodigy does not show me input sentences that match my patterns (well - it does, but only about 1 in 100 inputs, which almost looks random and I give up). I tried using both the small and large English models and a blank model.
Help please?


(kyle) #2

Hey,

I think I was facing a similar issue when trying to classify insults in german. In the end I used the seed list to manually identify examples which were likely to belong to my positive (insult) class. After annotating these I also mixed in some negative results as well. I’ve really noticed how important it is, especially at the beginning, that the class distribution is an even 50/50. After you have an initial dataset with the equal distributions you can batch-train a model and use that in future annotation sessions.

HTH


(Matthew Honnibal) #3

Well…1/1000 is extremely rare. I think you’ll do best making a custom recipe with your own heuristics to cue up data for initial annotation.

The problem is, we can’t really place high confidence on the model’s probability judgments at the start of the active learning. If the model is assigning 0.001% probability of a given class, we don’t immediately know whether that’s because the model is highly miscalibrated, or because that’s the actual class probability.

In your case, you have different information, that’s not available to the normal model — so I think you should be able to write a function that does better than the built-ins at queueing up your data.