My team has found that when running Prodigy dockerized as an app service on Azure and doing multi user sessions with an Azure Postgres database, we are seeing sentences dropping out and not get saved into the database when we look at the results even though they were provided as an input in the jsonl file. We are classifying sentences manually using the textcat.manual function and we have the settings set so that each sentence is only classified once total (no more than 1 person sees and categorizes each sentence). We're thinking that it has something to do with when pressing the save button. We all try to ensure we press the save button at the end of our session, but some of us also press it throughout our sessions just in case. Usually the entries that drop out are single entries, but occasionally we are seeing multiple consecutive sentences missing from our database... Why is this happening? Is it a bug? Most importantly, how can we fix it or avoid it until the bug is fixed?
Hi! I wouldn't necessarily recommend using instant_submit here – this will just add another layer of complexity, especially if you're using multiple sessions. I'd say it's relatively unlikely that it's related to the saving because if anything goes wrong here, you'd immediately see an error in the UI. The success message is shown if the back-end reports that the database accepted the data.
Did you verify that the examples were actually shown to users in the UI? Or are you mostly comparing the input and output data? If you're not sure if the example was presented for annotation in the first place, one thing to look at is the hashes generated for the examples. The most obvious case would be if you have actual duplicates in your sentences (which can easily happen, depending on the data). In that case, both examples will receive the same hash and you'll only see it once.
Another thing to try is: split up your data beforehand and start multiple instances (e.g. on different ports). All instances can save to the same dataset, but they're isolated otherwise. In general, this is what we'd recommend if you're easily able to divide your data and don't need to rely on sending out whatever is available next. For instance, if you're working with sentences, it makes a lot of sense to have the same person annotate consecutive sentences – you wouldn't want person A to get sentences 1 to 10, person B to get 11 to 20 and person A to get 21 to 30, and so on. That's unnecessarily disruptive and can more easily lead to mistakes. So it may be a better workflow in general, and will make it a lot easier to reason about what's going on.