prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start

Yeah, I think what you're observing here is that there are no pattern matches in the stream, only some initial suggestions from the model. Because the sorters operate over a (potentially infinite) stream and only get to see batches, they use an exponential moving average to decide what to send out. The main objectives here are: make sure there's always something to annotate and make sure the stream never gets stuck because the scores are too high/low.

One option would be to use a larger batch_size. I'm not sure you'd want to be operating on the whole stream, because you typically want the updates to the model to be reflected in the scores. If you just score the whole thing, you won't get to see updates scores.

You could also write your own logic that first finds all pattern matches in the stream and yields them out. Or sends out the first N pattern matches of the whole stream. Next, it iterates over the stream again and sends out examples suggested by the model only. Streams are regular Python generators, so you're pretty flexible in terms of how you set them up.

1 Like