Review dataset with multiple input hashes

Hello!

I'm writing to check my understanding of the review recipe, because it's doing something unanticipated but I'm not sure whether it's unintended:

I bootstrapped a dataset with match patterns for one label, trained an ner model for that label, and then used ner.manual and ner.correct to annotate additional examples. We added to the corpus of raw text over time such that I believe we re-annotated certain texts, and I also wanted to put a review stage into practice just to confirm our initial annotations.

I anticipated that ner review would enforce uniqueness on input hashes, but that doesn't seem to be the case, and the progress bar is showing that we've annotated 114% of our examples from our source dataset!

  1. Is something "going wrong" with the dataset output by the review recipe, or is it expected behavior that we may have duplicate input hashes in the dataset?
  2. I'm actually fine with going through the source dataset until I exhaust the stream-- it's still a good exercise even if it results in a bunch of input hash dupes. Am I correct in my thinking that data-to-spacy will dedupe input hashes, combining non-conflicting annotations, and that the review recipe ensures that the annotations in the dataset are non-conflicting? OR should I set up a process to confirm that annotations on the same dataset do not conflict with some post-processing hooks?

Thanks as always for your help and insights!

Adam

Hi! Are you using the latest version of Prodigy? I remember an issue in the past where the review recipe would report progress incorrectly – but this should have been fixed by now.

If your annotations were generated with a manual UI, the review recipe will merge all annotations with the same input hash, so you should only ever see the same text once, together with all available versions of annotations created for the given text. Based on these, you can then create a single correct answer. By default, you'll see every example, even if all annotators agree (because in theory, they could all be wrong).

Yes, data-to-spacy will merge annotations on the same texts and combine them – including annotations for different components (e.g. NER, text classification, dependencies etc.). However, if it comes across actual conflicts, it will have to discard all of the conflicting versions except one (and it obviously can't know what the correct answer is). That's where the review workflow comes in: it lets you double-check your annotations, and decide how to resolve conflicts. So a workflow could look like this:

  1. Collect annotation with some overlap for a given task, e.g. named entities.
  2. Run the review workflow with all NER datasets, resolve all conflicts and create a new final dataset with the correct version of all annotations.
  3. Optional: Repeat for other tasks like textcat if needed. Also resolve any potential problems that were surfaced, e.g. talk to an annotator who misunderstood the annotation scheme.
  4. Run data-to-spacy with the reviewed datasets to create a single, conflict-free training corpus.