active learning covering all candidates

Context: we're trying to customize active learning process to select most uncertain ones from all candidates for annotation.
Question: where to customize the scope? In the sample code below, does argument 'stream' in call() refer to all candidates or just some subset? Thanks

class DummyModel:
    def __init__(self, labels):
        # The model can keep arbitrary state – let's use a simple random float
        # to represent the current weights
        self.weights = random.random()
        self.labels = labels

    def __call__(self, stream):
        for eg in stream:
            # Score the example with respect to the current weights
            eg['label'] = random.choice(self.labels)
            score = (random.random() + self.weights) / 2
            yield (score, eg)

    def update(self, answers):
        # Update the model weights with the new answers
        self.weights = random.random()

We have some documentation for custom active learning models here and indeed Prodigy expects the __call__(self, stream) to make predictions that yield (score, example)-pairs. If you scroll down in the documented example you'll also see this pseudocode:

# pseudocode! 
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.sorters import prefer_uncertain

@prodigy.recipe("custom-textcat")
def custom_textcat_recipe(dataset, source):
    model = Model()
    stream = JSONL(source)              # load the data
    stream = model(stream)              # call custom predict function
    stream = prefer_uncertain(stream)   # sort to prefer uncertain scores

    return {
        "dataset": dataset,          # dataset to save annotations to
        "stream": stream,            # the incoming stream of examples
        "update": model.update,      # the update callback
        "view_id": "classification"  # annotation interface to use
    }

In this example you can see that the update callback refers to the update method of the model. When the model is updated then the next batch will receive new scores because of this line:

stream = model(stream)

Then, because in the next line this stream is filtered via prefer_uncertain it should give more examples that are closer to 0.5.

If you're interested in playing around with these "preferences", you can also visualise them in a Jupyter notebook if you like.

import random 
n = 10000
data_in = [(i/n, i/n) for i in range(n)]
random.shuffle(data_in)


import numpy as np
import matplotlib.pylab as plt

def grid_plots(funcs=[prefer_uncertain, prefer_high_scores, prefer_low_scores]):
    plt.figure(figsize=(12,4))
    for i, func in enumerate(funcs):
        received = list(func(data_in))
        plt.subplot(1, len(funcs), i + 1)
        plt.hist(received, bins=30)
        plt.title(f"{func.__name__}")

grid_plots()

This does this help? It's a bit unclear what you mean with "scope", could you elaborate?

Thank you very much for the prompt reply! Sorry for the unclear question, let me rephrase the 'scope'.
Let's assume there are 10k unlabeled documents, and we will need to do active learning:

  1. annotators label something on UI, e.g. 100 documents
  2. the model gets retrained/finetuned with new labeled data
  3. model re-score unlabeled documents and select uncertain samples for next round annotation

My question is: in step 3 when we call "stream = model(stream)", will the model re-score (10k-100 labeled) documents vs. only re-score a batch/subset of 10k, with batch size defined somewhere else such as 1000?
I guess the answer is former, (10k-100 labeled). Is that right?
Thanks.

Prodigy doesn't re-score everything because we don't know upfront how many unlabelled examples there are. It could be 100, it could be 100 million! So that is why Prodigy samples from the stream. Everything is running on a stream of batches, and it's using a clever sampling trick to pluck interesting candidates from each batch.

If you have a look at the simulation results above, you may recognize that there's a preference but not a hard cutoff. Prodigy does a variant of resevoir sampling such that the odds of selecting the interesting candidates are good. This allows us to select candidates from a near infinite stream, while also allowing us to update the model in the loop. The updated model has an updated belief on each batch that comes in.

Does this help?

Very helpful. thanks. Will implement my custom learning process.
Another question: where can I define batch size and uncertain example size? I would like to adjust them in order to cover more samples during random sampling.

The active learning process is: from all unlabeled examples (e.g. 10 million), it selects X(e.g. 100) examples for "model(stream)", then "prefer_uncertain(stream) " will yield Y(e.g. a few) examples for UI annotation. My understanding is:
X can be modified in prodigy config file
Y can be modified in prefer_uncertain()
Is that right? Thanks.