Documentation for prefer_low_scores, prefer_high_scores, prefer_uncertain

Prodigy has three built-in functions to drop examples based on a score: prefer_low_scores(), prefer_high_scores() and prefer_uncertain. All are found in the prodigy.components.sorters module. They are not documented anywhere, but very useful for anyone doing some custom stuff with Prodigy.

Because of the lack of documentation and available source code I ran experiments to figure out what exactly these functions do.

One important insight is that they drop 50% of examples. The distribution of scores of the remaining examples is shown in below histograms.

import random
import pandas as pd
import matplotlib.pyplot as plt
from prodigy.components.sorters import prefer_low_scores, prefer_high_scores, prefer_uncertain

n = 100000
stream = []
for _ in range(n):
    r = random.uniform(0, 1)
    stream.append((r, {'score': r}))

def show_histogram(func, ax):
    scores = [s['score'] for s in func(stream)]
    pd.Series(scores).hist(ax=ax)
    ax.set_title(func.__name__)

fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True, figsize=(15, 5))
show_histogram(prefer_low_scores, ax=ax[0])
show_histogram(prefer_high_scores, ax=ax[1])
show_histogram(prefer_uncertain, ax=ax[2])

I think it would be very useful to have this in the official documentation.

Hi! The histograms are pretty cool :+1: You can find the API docs of the sorters here – sorry if it was hard to find: https://prodi.gy/docs/api-components#sorters

The perfer_high_scores, prefer_low_scores and prefer_uncertain sorters select which examples to send out and operate on batches of potentially infinite streams. So they use an exponential moving average to decide which examples to send out, based on the distribution of previous scores. This also prevents it from getting stuck as the model's predictions change and it assigns different scores based on the previous annotations (see here for more details).