Best way to annotate rare labels for classification

Ah, sorry about that. Here's the source for combine_models, it's pretty straightforward:

from toolz.itertoolz import interleave, partition_all

def combine_models(one, two, batch_size=32):
    """Combine two models and return a predict and update function. Predictions
    of both models are combined using the toolz.interleave function. Mostly
    used to combine an EntityRecognizer with a PatternMatcher.
    one (callable): First model. Requires a `__call__` and `update` method.
    two (callable): Second model. Requires a `__call__` and `update` method.
    batch_size (int): The batch size to use for predicting the stream.
    RETURNS (tuple): A `(predict, update)` tuple of the respective functions.
    """

    def predict(stream):
        for batch in partition_all(batch_size, stream):
            batch = list(batch)
            stream1 = one(batch)
            stream2 = two(batch)
            yield from interleave((stream1, stream2))

    def update(examples):
        loss = one.update(examples) + two.update(examples)
        return loss

return predict, update

Yep, your sorter function is pretty much exactly how it should be.

Sorters are functions that take a stream of (score, example) tuples (as produced by Prodigy's built in models) and yield examples. The built-in sorters like prefer_uncertain and prefer_high_scores use an exponential moving average to decide whether to yield out an example or not. This prevents it from getting stuck if there's no even distribution of scores.

But you can also incorporate your own logic that's more specific. For example, you could check how many examples are already annotated and use that to calibrate the bias. You could also check for custom metadata you've added to example["meta"] or other properties (maybe you want to prioritise examples with longer text over examples with shorter text, or something like that).

1 Like