prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start

JanP · November 2, 2019, 4:23pm

Hello,

I am working on single-label multi-class text classification. As I don't want to annotate the very same labels again and again (and I am a newb so I can't write a stream filtering func for both task hash and keys either), I code each of my labels separately, into separate dbs (as recommended) that I later plan to merge together.

Since I want to gather the relevant positive examples as fast as possible, I use prefer_high_scores(predict(stream)). The expected behaviour: the first examples to annotate would be a series with high score and/or scored 0.5 because of the pre-defined patterns I have used for bootsrapping. Instead, the predicted high score (0.7) examples do come first, but the others rated by the patterns are slowly trickling down in each session.

Did I somehow manage to brake my stream?

I use these two overriding functions:

# highlight spans passed as _spans with the texts
def overwrite_spans(stream):
    for eg in stream:
        eg["spans"] = eg["_spans"]
        yield eg

#show each text only once
def filter_stream(stream):
        seen = set()
        for eg in stream:
            # Get the hash idenfitying the original input, e.g. the text
            input_hash = eg["_input_hash"]
            if input_hash not in seen:
                yield eg
            seen.add(input_hash)

stream = prefer_high_scores(predict(stream))
stream = filter_stream(stream)
stream = overwrite_spans(stream)

JanP · November 2, 2019, 5:55pm

I have noticed that after collecting 650 examples and restarting the textcat.teach, prefer_high_scores(predict(stream)) works as expected.

JanP · November 8, 2019, 4:01pm

Hi again,

so after another week of experimenting, I still cannot get the expected/desired behaviour, namely, having the prefer_high_scores sorter to present cases identified by pattern matcher before those with scores around 0.

My hunch is that the sorter makes the selection based on fixed sized sample of examples and if there simply aren't any that match the patterns, it simply presents what it has.

How can I get the sorter to scan over bigger window (or even the whole dataset) and present the texts matching the patterns first?

Thank you beforehand!!

ines · November 11, 2019, 12:04pm

Yeah, I think what you're observing here is that there are no pattern matches in the stream, only some initial suggestions from the model. Because the sorters operate over a (potentially infinite) stream and only get to see batches, they use an exponential moving average to decide what to send out. The main objectives here are: make sure there's always something to annotate and make sure the stream never gets stuck because the scores are too high/low.

One option would be to use a larger batch_size. I'm not sure you'd want to be operating on the whole stream, because you typically want the updates to the model to be reflected in the scores. If you just score the whole thing, you won't get to see updates scores.

You could also write your own logic that first finds all pattern matches in the stream and yields them out. Or sends out the first N pattern matches of the whole stream. Next, it iterates over the stream again and sends out examples suggested by the model only. Streams are regular Python generators, so you're pretty flexible in terms of how you set them up.

Topic		Replies	Views
Scoring and sorting all samples during textcat teach usage , textcat	2	474	November 2, 2020
prefer_uncertain: how does it use the stream to pick examples to score? usage , api	3	1349	December 12, 2017
using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions usage , streams	3	505	March 13, 2021
Bootstrapping using rule-based matching - handling conflicting patterns within single text usage , textcat	4	547	November 1, 2019
Control whether to show results from PatternMatcher or Model usage , textcat	3	278	September 23, 2021

prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start

Related Topics