Documentation for prefer_low_scores, prefer_high_scores, prefer_uncertain

simon.gurcke · January 8, 2020, 10:57pm

Prodigy has three built-in functions to drop examples based on a score: prefer_low_scores(), prefer_high_scores() and prefer_uncertain. All are found in the prodigy.components.sorters module. They are not documented anywhere, but very useful for anyone doing some custom stuff with Prodigy.

Because of the lack of documentation and available source code I ran experiments to figure out what exactly these functions do.

One important insight is that they drop 50% of examples. The distribution of scores of the remaining examples is shown in below histograms.

import random
import pandas as pd
import matplotlib.pyplot as plt
from prodigy.components.sorters import prefer_low_scores, prefer_high_scores, prefer_uncertain

n = 100000
stream = []
for _ in range(n):
    r = random.uniform(0, 1)
    stream.append((r, {'score': r}))

def show_histogram(func, ax):
    scores = [s['score'] for s in func(stream)]
    pd.Series(scores).hist(ax=ax)
    ax.set_title(func.__name__)

fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True, figsize=(15, 5))
show_histogram(prefer_low_scores, ax=ax[0])
show_histogram(prefer_high_scores, ax=ax[1])
show_histogram(prefer_uncertain, ax=ax[2])

I think it would be very useful to have this in the official documentation.

ines · January 9, 2020, 11:30am

Hi! The histograms are pretty cool You can find the API docs of the sorters here – sorry if it was hard to find: https://prodi.gy/docs/api-components#sorters

The perfer_high_scores, prefer_low_scores and prefer_uncertain sorters select which examples to send out and operate on batches of potentially infinite streams. So they use an exponential moving average to decide which examples to send out, based on the distribution of previous scores. This also prevents it from getting stuck as the model's predictions change and it assigns different scores based on the previous annotations (see here for more details).

Topic		Replies	Views
Is it possible to combine prefer_high_scores and prefer_uncertain so that a combination of high score and mid score documents can be batched out usage , streams	1	398	June 24, 2020
prefer_uncertain: how does it use the stream to pick examples to score? usage , api	3	1402	December 12, 2017
Prodigy Active Learning prefer_uncertain mechanism usage , custom , pytorch	8	1851	November 23, 2020
textcat.teach uncertain sorter show options with score 0 usage , textcat	3	391	August 30, 2022
prefer_high_scores(stream) not prioritising the bootstrapped texts during cold start usage , textcat , solved	3	640	November 11, 2019

Documentation for prefer_low_scores, prefer_high_scores, prefer_uncertain

Related topics