I'm writing to check my understanding of the review recipe, because it's doing something unanticipated but I'm not sure whether it's unintended:
I bootstrapped a dataset with match patterns for one label, trained an ner model for that label, and then used ner.manual and ner.correct to annotate additional examples. We added to the corpus of raw text over time such that I believe we re-annotated certain texts, and I also wanted to put a review stage into practice just to confirm our initial annotations.
I anticipated that ner review would enforce uniqueness on input hashes, but that doesn't seem to be the case, and the progress bar is showing that we've annotated 114% of our examples from our source dataset!
- Is something "going wrong" with the dataset output by the review recipe, or is it expected behavior that we may have duplicate input hashes in the dataset?
- I'm actually fine with going through the source dataset until I exhaust the stream-- it's still a good exercise even if it results in a bunch of input hash dupes. Am I correct in my thinking that data-to-spacy will dedupe input hashes, combining non-conflicting annotations, and that the review recipe ensures that the annotations in the dataset are non-conflicting? OR should I set up a process to confirm that annotations on the same dataset do not conflict with some post-processing hooks?
Thanks as always for your help and insights!