We are using the update method to save our annotations in an external database after some processing. This usually works fine, but occasionally something happens to the database connection.
Is there a way of failing to save gracefully without loosing the attempted batch in the frontend?
Is there a reason why you're using the update method instead of a script that uses db-out? The update method is more meant for logging and updating a machine learning model for active learning. If there's a database connection hiccup then it may not provide a retry unless you've set that up manually.
I actually thought this was the way to go. In general we deploy prodigy tasks that stream from db and save to db. The idea is that the examples are pulled when prodigy starts and we save as progress is made. We are using the prodigy.serve method in a python script inside a container. What would be the recommended way of doing this? Can you point to some example?
In general we deploy prodigy tasks that stream from db and save to db.
What database are you using with Prodigy? Typically, if you don't configure anything then you'll run Prodigy with an SQLite database that stores all the annotations. These annotations can be retrieved via:
prodigy db-out <dataset-name>
You can also access all the annotations via Python directly. To quote the docs:
from prodigy.components.db import connect
db = connect()
# To fetch all dataset names
db.datasets
# To get a list of dictionaries with annotations.
examples = db.get_dataset("my_dataset")
You can extend such a script to sync the data with other sources, but Prodigy can also be configured to use an external database directly. It's explained in more detail here. Out of the box Prodigy supports Postgres, MySQL and SQLite.
Yes, we are using default SQLite. We tried with no luck to make it work with an external db (Redshift). And maintaining a different external DB for this didn't seem like a good idea.
The other reason we are doing this is that we need is observability. That is why we need to keep in sync the progress made by different sessions with our external DB. We have dashboards to know what's been annotated, by whom and how.
You could explore making an implementation for Redshift by implementing a custom database connection. That could certainly be worth it for the long term, but I don't have enough experience with Redshift to give a good estimate of how hard it might be to implement.
Alternatively, you could try syncing with some sort of batch job via something like cron. The annotations in Prodigy should have a _timestamp key attached that could be used for comparisons with a DB. That way, you could try to only upload recent annotations that Redshift doesn't have yet.