Removing incorrect annotations

trevorwelch · January 7, 2020, 9:23pm

Hi all, and thanks for your continued work on this great tool.

One of our annotators just informed me that she has annotated a few hundred mark tasks incorrectly -- she was rejecting and accepting based on faulty domain knowledge.

What is the best practice for correcting this?

My thought was to end the current Prodigy server we have running, db-out the dataset, load the resulting jsonl file locally, remove the "answer" from the rows that I know were annotated incorrectly (I have a way to identify this rows), then export the file to jsonl again and start a new prodigy instance with a new prodigy dataset. I'm afraid this approach will result in the annotator having to re-annotate all of the rows in the entire dataset, as I'm not entirely sure how the "memorize" functionality works.

What is the preferred explosion.ai method for approaching this problem?

ines · January 8, 2020, 11:12am

Hi! I think the most straightforward solution would be something like this:

Export the full dataset with db-out.
Write your logic to find the examples that were annotated incorrectly and save out two files, e.g. correct.jsonl and incorrect.jsonl.
Add correct.jsonl to a new dataset – those won't need to be re-annotated.
Restart the recipe with incorrect.jsonl as the input data and go through those examples again. At the end of it, you should have a new dataset with the previously correct annotations and the re-annotated (and now also correct) other examples.

If this is a common workflow, you could of course automate it with a recipe. And if you're ever in a situation where you can't easily find the wrong annotations and/or you have multiple annotations on the same data that are conflicting, you could use the review recipe. This will show all examples and all available "versions" of the annotations, and lets you go through them again to make a final decision.

If an example that comes in is already present in the dataset, Prodigy will skip it. You can read more about the hashing and underlying principle here: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP If you're following the steps above, it won't matter, because there's likely no overlap between correct.jsonl and incorrect.jsonl.

Topic		Replies	Views
What is the best way to correct an annotation? usage	1	1138	May 8, 2019
Make Prodigy "forget" the answers on data import usage , database , solved	2	534	November 4, 2020
Annotation tasks finish even when more samples are in the jsonl dataset usage , solved , streams	5	446	April 8, 2022
Duplicates in revised annotations usage	2	574	May 29, 2019
Restore lost annotated dataset from training.jsonl and evalution.jsonl found in a trained model usage , database , solved	4	495	January 21, 2020

Removing incorrect annotations

Related topics