we're currently developing fairly complex annotation pipelines using Prodigy with @helmiina, in which the tasks flow from one recipe to another.
We will have multiple annotators working on various tasks in the pipeline.
Rather than checking each Prodigy instance for available tasks, we were thinking of creating a landing page that would indicate if there are annotation tasks available on each instance.
Do you have any ideas what would be the best way to implement this? Can we query the Prodigy instances for tasks or should we poll the database?
If I remember correctly the task is to show whether there are tasks available for the review, right? You have the annotated tasks stored in one dataset and then this dataset is also the input to the review recipe with the reviewed tasks being stored in another dataset.
In any case, you need a non-destructive way to track review availability without consuming the stream and that would be by querying the database directly. Specifically, you want to compare the input hashes in the review-input and review-output dataset, something like:
db = connect()
# Count total annotations
total_annotations = db.count_dataset(review_output_dataset)
# Get input hashes from annotation dataset
annotation_hashes = db.get_input_hashes(review_input_dataset)
# Get input hashes of already reviewed items
reviewed_hashes = db.get_input_hashes(review_output_dataset)
# Determine how many items still need review
pending_review = len(set(annotation_hashes) - set(reviewed_hashes))
Please note that it's not going to be to 100% exact because one batch of examples is going to be in the front-end buffer (already consumed and routed but not yet annotated and saved). To increase the precision you might want to reduce the batch size.
Another thing to you could consider if you need more precision is to persist the open tasks (routed but not yet annotated) in a custom database table and add it to estimation above. That would require your custom router to make a DB transaction (I'm mentioning the custom router because I know you use one and it would be most convenient place to do it). Not sure if it's worth it given that, in theory, this state would be very transitory. I wouldn't even bother to update the state of the opened tasks. If a task is both in open state and annotated state (i.e. already in the review output dataset) then it can be assumed it's already processed.
The database API is documented here - let us know if you need help implementing this!
Thanks for the informative answer as always! Actually, we have been thinking of a landing page that would show an annotator if there are any annotation tasks available, not just those that should be reviewed!
I suppose if we know the addresses of the Prodigy servers, we could simply connect to the database(s) and query them for unannotated tasks for a specific user?