Active learning model default sample size?

curious · March 17, 2020, 11:11pm

What's the default active learning random size? I meant how many cases the model will use to select the next text?
I have 85K text. Among them only 6 of them are positive cases. Those 6 cases were not promoted to be labelled when I used a OK model (0.85 AUC) for the active learning. I also update the sort order to prefer_high_scores. I wonder whether it's because the actively learning process makes a decision from a small amount of data set. What's the is default sample size? Is it possible to change the sample size to be bigger? Thanks.

ines · March 18, 2020, 9:51am

It's always processing one batch at a time, with the size defined by the "batch_size" setting. That's also the size of the batches used to update the model in the loop. The default batch size is 10, so this will look at and update with 10 examples at a time.

The prefer_ sorters have a mechanism built-in that uses an exponential moving average to determine what consitutes a high/low/uncertain score. That's done so you don't ever end up stuck in a loop if the scores vary, or the model gets stuck in an unideal state – so it'll always try to send something out and adjust the threshold if there are no examples that qualify in the previous batches.

Ultimately, the sorters are just generator functions that take (score, examples) tuples and yield examples. So you can also experiment with implementing your own, not using an exponential moving average, and so on. For example:

def custom_sorter(scored_examples):
    for score, example in scored_examples:
        # your own logic here to decide whether to send out an
        # example for annotation
        if score > 0.5:
            yield example

curious · March 18, 2020, 2:25pm

I updated the configuration and created my own sorter. It helped to surface more relevant text even with low positive data set. Thanks!

Another question, when I click the 'save' button. Does the problem only save the annotation? It'll be nice if the model can be updated and the next best candidate will be updated after I click the save button. Is there anyway I can chance the behavior of the 'Save' Button?

ines · March 18, 2020, 2:43pm

The answers are always sent back in batches, so Prodigy waits until it has a full batch of batch size batch_size, and then sends that back. It also keeps a batch on the client so you can undo if needed (before examples are sent back to the server).

When answers are received on the server, the update callback can then use them to update the model. And once the model is updated, it's used to process the next batch of the stream (whenever that is requested) and the updated predictions are reflected in the suggestions that are sent out.

You could set "instant_submit": true to immediately send back the examples as they're annotated – but it probably doesn't make much sense to update your model constantly with single examples. You typically want to wait until you have a reasonable batch of answers.

Topic		Replies	Views
Batch size usage , solved , streams	5	1655	October 29, 2020
Taking advantage of TONS of unlabeled data usage , solved	4	844	April 3, 2019
Active learning with a custom model usage , ner	1	341	November 10, 2021
Sorter Batch Size / Local Sorters? enhancement , usage , api	1	1026	March 14, 2018
Can you increase the question batch size in ner.teach active learning? usage , ner	1	465	June 21, 2020

Active learning model default sample size?

Related topics