A landing page for annotation tasks available across multiple Prodigy instances?

tuomo_h · April 11, 2025, 11:12am

Hey,

we're currently developing fairly complex annotation pipelines using Prodigy with @helmiina, in which the tasks flow from one recipe to another.

We will have multiple annotators working on various tasks in the pipeline.

Rather than checking each Prodigy instance for available tasks, we were thinking of creating a landing page that would indicate if there are annotation tasks available on each instance.

Do you have any ideas what would be the best way to implement this? Can we query the Prodigy instances for tasks or should we poll the database?

Happy to provide more information if needed!

magdaaniol · April 14, 2025, 9:09am

Hi @tuomo_h and @helmiina!

If I remember correctly the task is to show whether there are tasks available for the review, right? You have the annotated tasks stored in one dataset and then this dataset is also the input to the review recipe with the reviewed tasks being stored in another dataset.
In any case, you need a non-destructive way to track review availability without consuming the stream and that would be by querying the database directly. Specifically, you want to compare the input hashes in the review-input and review-output dataset, something like:

 db = connect()
  
 # Count total annotations
 total_annotations = db.count_dataset(review_output_dataset)
  
 # Get input hashes from annotation dataset
 annotation_hashes = db.get_input_hashes(review_input_dataset)
  
 # Get input hashes of already reviewed items
 reviewed_hashes = db.get_input_hashes(review_output_dataset)
  
 # Determine how many items still need review
 pending_review = len(set(annotation_hashes) - set(reviewed_hashes))

Please note that it's not going to be to 100% exact because one batch of examples is going to be in the front-end buffer (already consumed and routed but not yet annotated and saved). To increase the precision you might want to reduce the batch size.
Another thing to you could consider if you need more precision is to persist the open tasks (routed but not yet annotated) in a custom database table and add it to estimation above. That would require your custom router to make a DB transaction (I'm mentioning the custom router because I know you use one and it would be most convenient place to do it). Not sure if it's worth it given that, in theory, this state would be very transitory. I wouldn't even bother to update the state of the opened tasks. If a task is both in open state and annotated state (i.e. already in the review output dataset) then it can be assumed it's already processed.
The database API is documented here - let us know if you need help implementing this!

tuomo_h · April 22, 2025, 12:33pm

Hi @magdaaniol!

Thanks for the informative answer as always! Actually, we have been thinking of a landing page that would show an annotator if there are any annotation tasks available, not just those that should be reviewed!

I suppose if we know the addresses of the Prodigy servers, we could simply connect to the database(s) and query them for unannotated tasks for a specific user?

magdaaniol · April 25, 2025, 10:19am

Hi @tuomo_h!

I suppose if we know the addresses of the Prodigy servers, we could simply connect to the database(s) and query them for unannotated tasks for a specific user?

Yes, you should be able to query Prodigy DB from an external script. I don't know which DB you're using, but if it's SQLite, then one thing to have in mind is that SQLite uses a file-based locking mechanism at writing time, however, if your landing page will only read the data it shouldn't be problem. In general, SQLite wasn't designed for high-concurrency remote access so to make concurrent access easier you might consider switching to PostgreSQL.

tuomo_h · May 6, 2025, 8:50am

Thanks @magdaaniol – we're now working on this and hope to publish our solution together with the pipeline!

Topic		Replies	Views
How to review all the annotated tasks? usage	3	855	December 25, 2019
Continue to annotate same data in new session enhancement , done	19	4002	October 5, 2018
Bug with review recipe in 1.10.2+ done , review	8	672	September 8, 2020
Deployment of multiple recipes usage	2	1235	March 15, 2018
Multiple datasets in one session usage	1	799	June 24, 2019

A landing page for annotation tasks available across multiple Prodigy instances?

Related topics