prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start


I am working on single-label multi-class text classification. As I don't want to annotate the very same labels again and again (and I am a newb so I can't write a stream filtering func for both task hash and keys either), I code each of my labels separately, into separate dbs (as recommended) that I later plan to merge together.

Since I want to gather the relevant positive examples as fast as possible, I use prefer_high_scores(predict(stream)). The expected behaviour: the first examples to annotate would be a series with high score and/or scored 0.5 because of the pre-defined patterns I have used for bootsrapping. Instead, the predicted high score (0.7) examples do come first, but the others rated by the patterns are slowly trickling down in each session.

Did I somehow manage to brake my stream?

I use these two overriding functions:

# highlight spans passed as _spans with the texts
def overwrite_spans(stream):
    for eg in stream:
        eg["spans"] = eg["_spans"]
        yield eg

#show each text only once
def filter_stream(stream):
        seen = set()
        for eg in stream:
            # Get the hash idenfitying the original input, e.g. the text
            input_hash = eg["_input_hash"]
            if input_hash not in seen:
                yield eg

stream = prefer_high_scores(predict(stream))
stream = filter_stream(stream)
stream = overwrite_spans(stream)

I have noticed that after collecting 650 examples and restarting the textcat.teach, prefer_high_scores(predict(stream)) works as expected.

Hi again,

so after another week of experimenting, I still cannot get the expected/desired behaviour, namely, having the prefer_high_scores sorter to present cases identified by pattern matcher before those with scores around 0.

My hunch is that the sorter makes the selection based on fixed sized sample of examples and if there simply aren't any that match the patterns, it simply presents what it has.

How can I get the sorter to scan over bigger window (or even the whole dataset) and present the texts matching the patterns first?

Thank you beforehand!!

Yeah, I think what you're observing here is that there are no pattern matches in the stream, only some initial suggestions from the model. Because the sorters operate over a (potentially infinite) stream and only get to see batches, they use an exponential moving average to decide what to send out. The main objectives here are: make sure there's always something to annotate and make sure the stream never gets stuck because the scores are too high/low.

One option would be to use a larger batch_size. I'm not sure you'd want to be operating on the whole stream, because you typically want the updates to the model to be reflected in the scores. If you just score the whole thing, you won't get to see updates scores.

You could also write your own logic that first finds all pattern matches in the stream and yields them out. Or sends out the first N pattern matches of the whole stream. Next, it iterates over the stream again and sends out examples suggested by the model only. Streams are regular Python generators, so you're pretty flexible in terms of how you set them up.

1 Like