Batch size

Is there any way to control the batch size used by prodigy - I'm experimenting with a custom update callback and I'd like to set the batch size to 1 so I can test?

Also I was under the impression active learning usually works one sample at a time - i.e. if you use uncertainty sampling you pick the example with the most uncertainty. If you have many similar examples selecting a batch based on uncertainty will likely present many examples which are much the same?

I've seen a few papers on batch active learning where some form of clustering is used to select one sample from the midpoint of each cluster but that doesn't seem to be what the prodigy recipes are doing?

Hi! The batch_size setting in your prodigy.json lets you control the batch sizes of examples that are sent to the server and back. This will also impact how often the app asks for a new batch of questions whenever the queue is running low. Keep in mind that there'll always be at least one batch "in transit" – the app keeps a batch of batch_size or history_size (whichever is lower) in the history before it's sent back to the server. so you can undo easily. Then once an annotated batch is complete, it's sent back to the server.

The built-in sorters like prefer_uncertain operate on the whole stream before it's batched up and expect (score, example) tuples. You can read more about them here:

There's actually very little magic going on here – based on the score, it decides whether to yield the example or not. In the built-in functions, we also use an exponential moving average so we can process a potentially infinite generator, but avoid getting stuck in a suboptimal state, e.g. if the score threshold shifts slightly as the model is updated with more annotations.

That said, you should be able to very easily implement your own strategy in your custom recipe, right after loading and scoring your stream of raw examples. You can batch it up however you like, or even load all examples into memory and make multiple passes over them etc. At the end of it, the recipe should return a generator of dictionaries as it's "stream" – how those examples are selected is up to you :slightly_smiling_face:

Seems like batch_size must be > 1 in prodigy.json? Setting it to 1 results in my update callback getting called once and then the UI says there are no more records - setting it to 2 or more works fine.

I need to think a bit about a custom strategy. What I'm confused about is that stream is a python generator isn't it? Does prefer_uncertain tee the generator? Otherwise, how does it sort unlabelled examples without consuming the generator? The state of the pipeline will change via the model update and hence the unlabelled examples must be re-sorted on each update?

model = Model()

stream = CSV(source)
stream = add_tokens(stream)
stream = prefer_uncertain(model(stream))

The "sorter" terminology here is maybe a bit misleading, sorry about that – I guess it's probably more like a "filter". Based on the scores, it decides whether an example should be sent out for annotation or not. Depending on the strategy you choose, you could make your stream cycle infinitely, filter a bit more aggressively and keep sending out unseen examples so you can make multiple passes over the stream as the model is updated.

Thanks Ines, I understand now. I guess this strategy makes sense if you have many more examples than you need, but a fixed annotation budget.

What I'm trying to do is slightly different. I need the annotator to check every example but I want the model to label it if possible and the annotator to be able to click through as much as possible. But I see how I could implement this by maintaining a set of labelled and unlabelled examples and then loop until unlabelled is empty.

It would be nice though if the documentation contain links to the paper / algorithm implementation. I assume `prefer_uncertain' does some sort of random sampling - i.e. it either yields or skips each example with the probability being based on how close the score is to 0.5?

Yes, exactly – and it uses an exponential moving average to prevent you from getting stuck as the model changes (e.g. if the scores and up lower/higher over time). There's very little magic going on otherwise.