Synchronous Batch Mode

Is it possible via the prodigy.json file or otherwise to make prodigy work synchronously? What I want to do is select and send a batch for annotation, and wait for it to return. I've used the model predict and update pattern but it seems to be asynchronous - the ui seems to read well ahead so the results of an update to the model don't apply to the current batch which is being annotated. I'm doing research so I'm looking for full control of the query / teach interface.

Hi! One of the main concerns for the web app is to make sure the queue is never running low, so it will keep asking for new batches in the background as the current batch is being annotated. The motivation here is that one individual example typically doesn't matter very much, and a single update to the model won't move the needle enough for it to be worth blocking on it. Depending on the model, blocking could easily take 10+ seconds or even a minute, so that's not really viable to do synchronously

Setting "instant_submit": true will send an answer back as it's annotated, so you could enable that, collect the individual answers as they come back and then batch them up to periodically update your model. It still wouldn't block, but with a batch size of 1, you'd only ever have it ask for the next example in the background. Another option that's less elegant but could work: you could send dummy examples in between that keep the app busy, and in the worst case, you'd see the dummy example in the UI (that could say something like "please wait") and then skip it to move on.

Thanks @ines, I've tried this though and every time I set batch_size=1 I get one example to annotate. When I hit accept update is called and the UI says "No Tasks Available"

     "batch_size": 1,
     "instant_submit": true

If I set batch_size to 2, my update callback gets called twice (two lists each containing 1 example) which makes sense as that's controlled by instant_submit but then I still get "No Tasks Available"

I tried various combinates for batch_size and instant_submit and found that there's seems to be a minimum value for batch_size (around 5) below which the UI displays "No Tasks Available", in every case I had to set instant_submit false.

I think the main decisive factor here might be how long it takes for the stream to produce the next example – if there's a model in the loop that needs to process the text, that may take a while, and a lower batch size obviously makes it harder to take advantage of batch processing to speed things up.

If you do hit the "no tasks available", does refreshing the browser help?

Thanks @ines, I don't actually have a model in the loop. I'm streaming JSONL with a pattern matcher and add_tokens.

Yes refreshing the browser each time it says "no task available" does in fact help. I have to do it after every example though.

I did notice that the validate_answer callback seems to work regardless of the settings - it gets called on every example. Of course, the controller will have read the next n examples so it doesn't quite do what I need. My main issue at the moment is my source data is sorted pretty much by entity mention - I probably need to shuffle it so my matcher is useful (i.e. at the moment one batch will contain a sequence of the same entity quite a lot of the time - if I add it via the update it's often too late as that specific entity won't occur again but the annotator has to tag it 10 times in a row)

PS The doc string for add_tokens is missing documentation for the last 2 arguments (overwrite and use_chars)