prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start

ines · November 11, 2019, 12:04pm

Yeah, I think what you're observing here is that there are no pattern matches in the stream, only some initial suggestions from the model. Because the sorters operate over a (potentially infinite) stream and only get to see batches, they use an exponential moving average to decide what to send out. The main objectives here are: make sure there's always something to annotate and make sure the stream never gets stuck because the scores are too high/low.

One option would be to use a larger batch_size. I'm not sure you'd want to be operating on the whole stream, because you typically want the updates to the model to be reflected in the scores. If you just score the whole thing, you won't get to see updates scores.

You could also write your own logic that first finds all pattern matches in the stream and yields them out. Or sends out the first N pattern matches of the whole stream. Next, it iterates over the stream again and sends out examples suggested by the model only. Streams are regular Python generators, so you're pretty flexible in terms of how you set them up.

Topic		Replies	Views
Is it possible to combine prefer_high_scores and prefer_uncertain so that a combination of high score and mid score documents can be batched out usage , streams	1	397	June 24, 2020
Scoring and sorting all samples during textcat teach usage , textcat	2	533	November 2, 2020
prefer_uncertain: how does it use the stream to pick examples to score? usage , api	3	1400	December 12, 2017
using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions usage , streams	3	584	March 13, 2021
Bootstrapping using rule-based matching - handling conflicting patterns within single text usage , textcat	4	572	November 1, 2019

prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start

Related topics