Is it possible to combine prefer_high_scores and prefer_uncertain so that a combination of high score and mid score documents can be batched out

I would like to batch out a combination of high score documents and mid-score documents to annotators (roughly 9:1 ratio). I am not sure how it can be done. Currently, my relevant setup is as follows:

nlp = spacy.load(MODEL_PATH, disable=["parser", "ner"])
stream = get_stream_filter(source, db, dataset)
stream = add_tokens(nlp, prefer_high_scores(model_score(stream, nlp)))

where the model_score function is defined as follows:

def model_score(stream, nlp):
    for eg in stream:
        yield (nlp(eg["text"]).cats["COMPLIMENT"], eg)

Any pointers would be greatly appreciated! Thanks.

Hi! I think for a use case like that, you probably just want to write your own sorter function. Under the hood, the sorters like prefer_high_scores and prefer_uncertain are just functions that take a stream of (score, example) tuples and decide whether to yield and example. For instance:

def custom_sorter(scored_stream):
    for score, eg in scored_stream:
        # TODO: your conditional logic that decides whether to send 
        # out the example or not
        yield eg

Within that function, you can keep any state, like a counter of the high/uncertain scores you sent out previously to make sure you keep the same ratio. Depending on your data, you might also want to include logic to ensure that you don't get stuck in a suboptiomal state and stop sending out examples – for instance, if your model somehow only ends up producing super low scores. (In the built-in sorters, Prodigy uses an exponential moving average.)