Removing incorrect annotations

Hi all, and thanks for your continued work on this great tool.

One of our annotators just informed me that she has annotated a few hundred mark tasks incorrectly -- she was rejecting and accepting based on faulty domain knowledge.

What is the best practice for correcting this?

My thought was to end the current Prodigy server we have running, db-out the dataset, load the resulting jsonl file locally, remove the "answer" from the rows that I know were annotated incorrectly (I have a way to identify this rows), then export the file to jsonl again and start a new prodigy instance with a new prodigy dataset. I'm afraid this approach will result in the annotator having to re-annotate all of the rows in the entire dataset, as I'm not entirely sure how the "memorize" functionality works.

What is the preferred method for approaching this problem?

Hi! I think the most straightforward solution would be something like this:

  1. Export the full dataset with db-out.
  2. Write your logic to find the examples that were annotated incorrectly and save out two files, e.g. correct.jsonl and incorrect.jsonl.
  3. Add correct.jsonl to a new dataset – those won't need to be re-annotated.
  4. Restart the recipe with incorrect.jsonl as the input data and go through those examples again. At the end of it, you should have a new dataset with the previously correct annotations and the re-annotated (and now also correct) other examples.

If this is a common workflow, you could of course automate it with a recipe. And if you're ever in a situation where you can't easily find the wrong annotations and/or you have multiple annotations on the same data that are conflicting, you could use the review recipe. This will show all examples and all available "versions" of the annotations, and lets you go through them again to make a final decision.

If an example that comes in is already present in the dataset, Prodigy will skip it. You can read more about the hashing and underlying principle here: If you're following the steps above, it won't matter, because there's likely no overlap between correct.jsonl and incorrect.jsonl.