Best way to annotate rare labels for classification


(🍁 Tal Weiss) #1

My text data has a label I’m trying to classify which is rare in the data ~1/1000 of the sentences.
I have good patterns which have about 10% precision but over 90% recall.
I can’t get textcat.teach to show me these samples. I’ve tries different setups, including a custom recipe sorting the stream using prefer_high_scores, prefer_low_scores and prefer_uncertain with ‘probability’ / ‘ema’. In all these scenarios Prodigy does not show me input sentences that match my patterns (well - it does, but only about 1 in 100 inputs, which almost looks random and I give up). I tried using both the small and large English models and a blank model.
Help please?

(kyle) #2


I think I was facing a similar issue when trying to classify insults in german. In the end I used the seed list to manually identify examples which were likely to belong to my positive (insult) class. After annotating these I also mixed in some negative results as well. I’ve really noticed how important it is, especially at the beginning, that the class distribution is an even 50/50. After you have an initial dataset with the equal distributions you can batch-train a model and use that in future annotation sessions.


(Matthew Honnibal) #3

Well…1/1000 is extremely rare. I think you’ll do best making a custom recipe with your own heuristics to cue up data for initial annotation.

The problem is, we can’t really place high confidence on the model’s probability judgments at the start of the active learning. If the model is assigning 0.001% probability of a given class, we don’t immediately know whether that’s because the model is highly miscalibrated, or because that’s the actual class probability.

In your case, you have different information, that’s not available to the normal model — so I think you should be able to write a function that does better than the built-ins at queueing up your data.

(🍁 Tal Weiss) #4

Thanks for the reply! I didn’t think my problem is that uncommon.

Is there a way to bias combine_models()? I couldn’t find the source code for it and a simple PatternMatcher does find my examples (though, of course, nothing else but them). The function is not documented.

Also, where is the actual active learning being done? Is it in the sorters (e.g. prefer_uncertain() )? Is there an example of writing a custom sorter?
I wrote this to figure out the API…

def my_sorter(stream_of_score_example, bias):
    for score, example in stream_of_score_example:
       if score > bias:
            yield example

(Ines Montani) #5

Ah, sorry about that. Here’s the source for combine_models, it’s pretty straightforward:

from toolz.itertoolz import interleave, partition_all

def combine_models(one, two, batch_size=32):
    """Combine two models and return a predict and update function. Predictions
    of both models are combined using the toolz.interleave function. Mostly
    used to combine an EntityRecognizer with a PatternMatcher.
    one (callable): First model. Requires a `__call__` and `update` method.
    two (callable): Second model. Requires a `__call__` and `update` method.
    batch_size (int): The batch size to use for predicting the stream.
    RETURNS (tuple): A `(predict, update)` tuple of the respective functions.

    def predict(stream):
        for batch in partition_all(batch_size, stream):
            batch = list(batch)
            stream1 = one(batch)
            stream2 = two(batch)
            yield from interleave((stream1, stream2))

    def update(examples):
        loss = one.update(examples) + two.update(examples)
        return loss

return predict, update

Yep, your sorter function is pretty much exactly how it should be.

Sorters are functions that take a stream of (score, example) tuples (as produced by Prodigy’s built in models) and yield examples. The built-in sorters like prefer_uncertain and prefer_high_scores use an exponential moving average to decide whether to yield out an example or not. This prevents it from getting stuck if there’s no even distribution of scores.

But you can also incorporate your own logic that’s more specific. For example, you could check how many examples are already annotated and use that to calibrate the bias. You could also check for custom metadata you’ve added to example["meta"] or other properties (maybe you want to prioritise examples with longer text over examples with shorter text, or something like that).

(🍁 Tal Weiss) #6

I might be doing something wrong, but my PatternMatcher only emits tasks, which are pattern matches, which are rare… This means that combine_models.predict() emits only the TextClassifier outputs (most of the time)!
Is the PatternMatcher supposed to return non-matching “input sentences” as tasks with a lower score?

(🍁 Tal Weiss) #7

Assuming PatternMatcher is working correctly, I ended up changing combine_models to return this predict function:

def predict(stream):
    for batch in partition_all(batch_size, stream):
        batch = list(batch)
        stream1 = one(batch)
        stream2 = two(batch)
        yield from itertools.chain.from_iterable(zip(stream1, stream2))

It still interleaves the streams, but stops according to the shortest, which yields same length outputs (balanced data).

(🍁 Tal Weiss) #8

One more sub-question: if PatternMatcher matches more than 1 pattern in an input, it emits more than 1 task. But that might contaminate the annotations for a text classifier, over-fitting the input with itself.
Is there a way to limit the PatternMatcher output to only 1 task per input?

(Ines Montani) #9

The pattern matcher will only yield out matches – all of them. It also attaches a score to the tasks, so if you’re using a sorter, that score will also be used to determine whether to send it out or not. So if you’re not seeing a match, that’s usually why.

Btw, the PatternMatcher is really mostly a wrapper around spaCy’s Matcher and PhraseMatcher (see here). So you can always build your own if you want more control over the matches.

Each task has an _input_hash (describing the original input, e.g. the text) and a _task_hash, describing the specific task. So you can use those to ensure that the same input is only presented to the annotator once. (Or maybe you do want to see it twice if it’s produced by both the model and the matcher, and then filter it out later. That really depends on your workflow.)