Is there an option to autosave the data in the dataset after every nth annotation ?
Yes, annotations are saved automatically once a full batch (via the
batch_size setting, which should default to 10) is collected. Prodigy allows you to undo/change annotations in the “history” in the sidebar. After that, they’re sent back to the server in batches. The only time you have to save manually is when you want to stop annotating in the browser, to ensure that the entire history and outbox is cleared and saved.
It seems even with batch_size=1 the auto-save is 2 examples behind, so if you just close the window at the end of a labeling session you lose the last 2 labels. For example I had labeled 14 examples in the UI, but only had 12 in the DB until I hit the “save” button manually. Is there a way to fix this?
Any ideas how this could be fixed? Debugging the bundle.js sounds a bit painful since it’s minified… Without autosave working correctly on every example, it’s hard to make other things work like surviving page refreshes/closing the browser.
I’ve posted an idea in the thread referenced above:
The main thing is that this should be decoupled from the batch size and become a separate setting – either
instant_submit or even a number of maximum answers to keep (if set to
0, this would mean every answer is saved immediately). You could still combine it with a batch size of 1, which would have the effect you’re looking for – but users could also send out larger batches while still having answers sent back instantly (e.g. to do single updates to the model or trigger something else on every update).
So this is definitely something I can implement for the next release.
Awesome! Looking forward to it.
@taavi What do you think of the name
answer_batch_size for the setting?
I first wanted to call it
auto_save_threshold, but that’s possibly too cryptic.
answer_batch_size is analogous to the regular batch size setting and ultimately, that’s really what it is: it’s the number of examples sent back to the server at once. Prodigy will wait until one batch is full. So setting
"answer_batch_size": 1 means that we only need to wait for one answer == auto-save.
Sounds good to me. I’m not too picky about the setting name
Hi @ines – will this enable the use case from the other thread linked above? I’m hoping that when a user submits an answer, it goes immediately to the db, so that the data iterator can query the DB in its
__iter__() method, and choose the very next item to yield based on the user’s answer. I’m only asking because
answer_batch_size=1 sounds like there might still be a batch that would cause a delay of 1 – that is a “wait for one answer”. But I might just be thinking about it wrong. Thank you so much for your help on this!
An answer batch size of 1 would mean that as soon as there’s 1 answer in the app, it gets sent back.
For your scenario, there’s still a more problem that’s more difficult to solve in general: you cannot know how long the requests will take. While the answer is sent back, a new batch will already be requested (also since the queue is empty) and it will most likely arrive before the answer is received, stored in the database and retrieved from there in your stream.
Prodigy tries its best to always make sure that there are enough questions – after all, annotation can be pretty fast and the annotator should never see a “Loading…” message in between. You could probably work around that by just waiting to yield out the next task until you have the previous one – but once you’ve implemented that, you might find that it creates a pretty frustrating annotation exprience.
This sounds like it will work perfectly for our use case, where we are not concerned about a few seconds of “Loading…” while waiting to yield. Any sense for when a version of Prodigy with the new
answer_batch_size parameter will be released?
It’s the last ticket on my list for Prodigy 1.7.!
And yeah, in that case, I’m pretty sure the idea would work. Ultimately, you’d just have to orchestrate this between two closures: the stream generator function and the update callback.
Okay, so I ended up going with
"instant_submit" after all (sorry for the back and forth), because the answer batch size would just lead to many confusing scenarios that were internally consistent and “logical”, but not always what you’d expect. But instant submission is now available in v1.7.0 and should work as expected Thanks again for your patience on this!
I finally got a chance to try this out and it’s nice, thanks for implementing it!
There’s only 1 more small issue: the UI calls “get_questions” before it calls “give_answers”, so in case we implement any kind of refresh or browser restart resilience, the last example is shown twice, because it’s not yet saved by the time we’re asking for more examples.
By default whenever you close the browser or refresh, examples that were previously shown are lost forever, since the stream will just pull the next one. In a scenario where we’re trying to get a whole dataset labeled without missing values, that doesn’t work super great - sometimes people’s batteries die or whatever, they need to be able to return to the task.
The very first request always needs to be
get_questions, because there needs to be a queue of questions. After that, the app tries to make sure there are always enough questions in the queue so that the annotator never hits a block or needs to wait long for the next batch. Also see my comments on this from earlier in the thread:
Yes, in that case (at least if you want to handle everything in a single session), you probably want your stream to be an infinite loop that keeps repeating the stream until all examples are confirmed to be in the dataset. See here for an example.
Yeah I have the endless loop logic implemented, but you see how it’s impossible for me to not show the last example twice, if the “save” doesn’t go through before the “fetch”, right? Of course on a fresh load, the app needs to fetch example(s) first, but when saving, the only reason to read before writing is a minor speed boost…
Yes, I definitely know what you mean. I’ll have a look and test if we can delay requesting a new batch in the “instant submit” mode until the answers are sent. I think the tricky part here is that the request posting the answers doesn’t only have to be made, the answer also has to be present in the database already before the next batch is requested.
Is there any update on this? I am also facing this issue where, even with infinite loop to check if data exists, 2 or examples get repeated bacause of pre-fetch before saving.
Also, when I try to save manually by setting "db": False in my custom recipe, I get an error stating "AttributeError: 'bool' object has no attribute 'db'". Any pointers?
Are you setting
"instant_submit": True? This will submit an answer as soon as it's created in the UI.
If you want to disable saving to the database, try setting
"dataset": False. (Depending on what you're doing, it might still make sense to keep saving to a local SQLite DB as a backup, just in case.)
Yes I set "instant_submit": True. But the case is, next batch is fetched before saving the current answer. So there is a chance that same image which I just annotated appears in the next batch. So the _task_hash checking for duplication does not work there.
I will try that and see.
Thank you for your prompt reply.