default aktive learning

,

I am continually getting entities that the model is predicting with very low probability.
I believe some of the labels I am trying to classify occur relatively infrequently in my corpus.

I am wondering about the default behavior for ner.teach. Specifically, how is it choosing the “most uncertain” span? Does it do it for every example in a batch? For the top X most uncertain in each batch?

How can I hack it so that I only receive candidate labels when the models prediction is in between a range of probabilities? Or do you have any other suggestions for getting around this issue?

Thanks!

The sorting in the ner.teach is a little bit more intricate that the other teach recipes, because within a batch, we try to group questions about the same entity. A byproduct of this sorting is that you often get small runs of accepts and rejects. I find this helps me click through faster — if I’m getting similar questions together and they’re all unlikely, I can do three or four per second.

The entity grouping happens within the EntityRecognizer.__call__() function. On top of this, the ner.teach recipe also calls into prefer_uncertain, which sorts the stream to prefer questions closer to 0.5 in probability.

Applying a threshold to the scored stream is very easy. You can just have a stream filter function like this:


def filter_unlikely(scored_stream, min_score):
    for score, eg in scored_stream:
        if score >= 0.1:
            yield min_score, eg

Usage would look like this:


scored_stream = model(stream)
scored_stream = filter_unlikely(stream, 0.1)
stream = prefer_uncertain(stream)

Some care needs to be taken when applying a minimum threshold to the stream, however. If you have a cut-off like this, the model can get into a state where you’re asked no questions. If the model learns to assign >0.99 probability to an analysis that has no entities, you’ll get no questions if you have a minimum threshold of 0.01. So, you can get stuck.

Without a minimum threshold, the model will continue asking you questions about the most likely entities, even if the scores assigned to them are very small.