Manual Annotation Dataset limit


I am doing manual annotations for a classifier. I want to set a threshold on the number of "Accepted" annotations.

For example: If we have 100 (Accept) samples the annotation can end. I am not using a classifier to train, it's a manual annotation.

Hi! In Prodigy v1.10, you could implement something like that using the validate_answer callback. It's not 100% what that function was originally designed to do, but it should work :slightly_smiling_face: The annotator will then see an alert when 100 accepted answers are submitted and won't be able to submit any more.

To count the existing accepted answers, you could use the update callback, which gives you access to the batches of annotations that come back to the server. Here's a simple example:

total_accepted = 0

def update(answers):
    total_accepted += len([eg for eg in answers if eg["answer"] == "accept"])

def validate_answer(eg):
    if eg["answer"] == "accept" and total_accepted >= 100:
        raise ValueError("Enough accepted answers, you can stop :)")

One thing to keep in mind is that depending on the batch size, there may be a small delay until the annotator sees the alert, because the batches of answers first have to be sent back to the server until Prodigy can know that 100 annotations are there. You could minimise that by using a lower batch size or setting "instant_submit": True to immediately submit each answer as it's made in the app.

1 Like

Thanks @Ines.

This works, another quick continuation question would be - can I update the Accept, Reject, Ignore from the previous annotations where this was not included and compute the total?

May be my question wasn't clear. I need to get the total counts over multiple sessions basically to draw the counts from the db answers not just the current annotations.


In that case, you could connect to the database and then call db.get_dataset to load a dataset and/or session to pre-populate your counts. You don't want to do that within the validate_answer callback, because otherwise, that would be called and re-computed every time a user submits an answer. If you do expect the dataset to change as the user annotates, you could update the counts periodically so it's faster and less epxensive to compute.