Save annotations with update method / Fail gracefully

raulsperoni · May 27, 2022, 5:20pm

Hello,

We are using the update method to save our annotations in an external database after some processing. This usually works fine, but occasionally something happens to the database connection.

Is there a way of failing to save gracefully without loosing the attempted batch in the frontend?

Thank you very much.

koaning · May 30, 2022, 8:10am

Is there a reason why you're using the update method instead of a script that uses db-out? The update method is more meant for logging and updating a machine learning model for active learning. If there's a database connection hiccup then it may not provide a retry unless you've set that up manually.

raulsperoni · May 30, 2022, 1:34pm

I actually thought this was the way to go. In general we deploy prodigy tasks that stream from db and save to db. The idea is that the examples are pulled when prodigy starts and we save as progress is made. We are using the prodigy.serve method in a python script inside a container. What would be the recommended way of doing this? Can you point to some example?

Thank you very much.

koaning · May 31, 2022, 9:08am

In general we deploy prodigy tasks that stream from db and save to db.

What database are you using with Prodigy? Typically, if you don't configure anything then you'll run Prodigy with an SQLite database that stores all the annotations. These annotations can be retrieved via:

prodigy db-out <dataset-name>

You can also access all the annotations via Python directly. To quote the docs:

from prodigy.components.db import connect

db = connect()

# To fetch all dataset names
db.datasets

# To get a list of dictionaries with annotations. 
examples = db.get_dataset("my_dataset")

You can extend such a script to sync the data with other sources, but Prodigy can also be configured to use an external database directly. It's explained in more detail here. Out of the box Prodigy supports Postgres, MySQL and SQLite.

raulsperoni · May 31, 2022, 12:42pm

Again thank you for taking the time to answer.

Yes, we are using default SQLite. We tried with no luck to make it work with an external db (Redshift). And maintaining a different external DB for this didn't seem like a good idea.

The other reason we are doing this is that we need is observability. That is why we need to keep in sync the progress made by different sessions with our external DB. We have dashboards to know what's been annotated, by whom and how.

I understand that saving annotations in the update method could be causing other issues for us (repeated examples across sessions even with feed_overlap = False).

Would you say that using some kind of schedule python library and the programmatic way of accessing the datasets could be the way to go?

Thank you.

koaning · May 31, 2022, 1:00pm

You could explore making an implementation for Redshift by implementing a custom database connection. That could certainly be worth it for the long term, but I don't have enough experience with Redshift to give a good estimate of how hard it might be to implement.

Alternatively, you could try syncing with some sort of batch job via something like cron. The annotations in Prodigy should have a _timestamp key attached that could be used for comparisons with a DB. That way, you could try to only upload recent annotations that Redshift doesn't have yet.

raulsperoni · May 31, 2022, 2:22pm

This sounds like the way to go. Thank you.

Topic		Replies	Views
Saving and retrieving annotations usage , database , custom , solved	7	5122	June 13, 2018
Getting access to annotations before placed in db usage , database , custom , solved	8	2044	October 31, 2019
Share datasets between prodi gy users usage , database	3	951	May 7, 2020
how to update records for annotations in realtime database , solved , streams	1	558	June 14, 2022
Customize the JSON format when saving the annotations in database? database , solved	3	2047	May 24, 2018

Save annotations with update method / Fail gracefully

Related topics