Prodigy has three built-in functions to drop examples based on a score: prefer_low_scores()
, prefer_high_scores()
and prefer_uncertain
. All are found in the prodigy.components.sorters
module. They are not documented anywhere, but very useful for anyone doing some custom stuff with Prodigy.
Because of the lack of documentation and available source code I ran experiments to figure out what exactly these functions do.
One important insight is that they drop 50% of examples. The distribution of scores of the remaining examples is shown in below histograms.
import random
import pandas as pd
import matplotlib.pyplot as plt
from prodigy.components.sorters import prefer_low_scores, prefer_high_scores, prefer_uncertain
n = 100000
stream = []
for _ in range(n):
r = random.uniform(0, 1)
stream.append((r, {'score': r}))
def show_histogram(func, ax):
scores = [s['score'] for s in func(stream)]
pd.Series(scores).hist(ax=ax)
ax.set_title(func.__name__)
fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True, figsize=(15, 5))
show_histogram(prefer_low_scores, ax=ax[0])
show_histogram(prefer_high_scores, ax=ax[1])
show_histogram(prefer_uncertain, ax=ax[2])
I think it would be very useful to have this in the official documentation.