Prodigy Version 1.5.1 vs 1.4.2

Dears,

it appears there a difference between version 1.4.2 and 1.5.1 with regard to annotations displayed.

In version 1.5.1 it seems that after a web browser refresh there is a new annotation, and it is possible that some text are skipped. This way we may not exploit all dataset.

thank you in advance
kind regards
claudio nespoli

Hi! I’m surprised you saw different behaviour in v1.4.2, because the way streams work hasn’t really changed. On load, the web app makes a request to the /get_questions endpoint, which then requests the next batch from the stream. By default, the stream doesn’t know what you’ve already annotated and what’s still in progress, so it will always return you the next batch if you reload the browser.

If it’s important to you that you annotate a full dataset in order, you could manage this in a custom loader and use the _task_hash to check whether an examples has already been annotated (i.e. is already in the database) or whether to send it out again. (Note that this will only work if you’re using a “static” recipe without any active learning or a model in the loop. The active learning recipes will use the model’s predictions to decide whether or not to show it for annotation. So you’ll only see a selection of examples based on the current model state.)

Thank you,

I will try that for a “static” recipe (we could call it passive learning?), and with regard to active learning, we will try to reduce the number of refreshes.

I will also check (maybe useful) if during active learning after it completes all batches of the data set it starts again from the beginning or stops the learning with “no tasks available”.

thank you again
kind regards
Claudio Nespoli

By default, it doesn't do that. The active learning recipes will try to select the examples the model is most uncertain about, so it's usually not that helpful to start again at the beginning and get predictions for the same examples again.

But a Prodigy stream is really just a Python generator, so you can always write your own logic. For example, you could make it fetch a new file if no texts are left. Or you could have it check how well the model is learning and only load new texts from a different source if it hasn't learned enough yet. (When using an active learning recipe, the progress you see in the app and that's available as the progress attribute of the controller is an estimation of when the loss will hit 0, i.e. when the model has learned "everything it could learn" from the data.)