I'm writing to check my understanding of the review recipe, because it's doing something unanticipated but I'm not sure whether it's unintended:
I bootstrapped a dataset with match patterns for one label, trained an ner model for that label, and then used ner.manual and ner.correct to annotate additional examples. We added to the corpus of raw text over time such that I believe we re-annotated certain texts, and I also wanted to put a review stage into practice just to confirm our initial annotations.
I anticipated that ner review would enforce uniqueness on input hashes, but that doesn't seem to be the case, and the progress bar is showing that we've annotated 114% of our examples from our source dataset!
Is something "going wrong" with the dataset output by the review recipe, or is it expected behavior that we may have duplicate input hashes in the dataset?
I'm actually fine with going through the source dataset until I exhaust the stream-- it's still a good exercise even if it results in a bunch of input hash dupes. Am I correct in my thinking that data-to-spacy will dedupe input hashes, combining non-conflicting annotations, and that the review recipe ensures that the annotations in the dataset are non-conflicting? OR should I set up a process to confirm that annotations on the same dataset do not conflict with some post-processing hooks?
Hi! Are you using the latest version of Prodigy? I remember an issue in the past where the review recipe would report progress incorrectly – but this should have been fixed by now.
If your annotations were generated with a manual UI, the review recipe will merge all annotations with the same input hash, so you should only ever see the same text once, together with all available versions of annotations created for the given text. Based on these, you can then create a single correct answer. By default, you'll see every example, even if all annotators agree (because in theory, they could all be wrong).
Yes, data-to-spacy will merge annotations on the same texts and combine them – including annotations for different components (e.g. NER, text classification, dependencies etc.). However, if it comes across actual conflicts, it will have to discard all of the conflicting versions except one (and it obviously can't know what the correct answer is). That's where the review workflow comes in: it lets you double-check your annotations, and decide how to resolve conflicts. So a workflow could look like this:
Collect annotation with some overlap for a given task, e.g. named entities.
Run the review workflow with all NER datasets, resolve all conflicts and create a new final dataset with the correct version of all annotations.
Optional: Repeat for other tasks like textcat if needed. Also resolve any potential problems that were surfaced, e.g. talk to an annotator who misunderstood the annotation scheme.
Run data-to-spacy with the reviewed datasets to create a single, conflict-free training corpus.
Hello! Apologies for my delay in response here-- as always, appreciate your time and attention!
I am using the latest beta, yes: 1.11.0a7.
Your response helped me clarify the situation I'm running into. I am reviewing two datasets into a gold dataset, but updating that gold reviewed dataset with new annotations from the "live" dataset as they come in.
correct_ds is using ner.correct, so the manual UI-- this is the one we're actively using to hold annotations
teach_ds is from an ner.silver-to-gold, so also manual UI
gold_ds is the reviewed gold dataset
We're not updating teach_ds but we regularly add to correct_ds. I want to resolve any conflicts arising from new annotations in correct_ds with existing annotations from teach_ds.
Is it reasonable to repeatedly do a new batch of annotations in correct_ds and then review correct_ds and teach_ds into gold_ds, under the assumption that any new annotations in correct_ds that overlap with those in teach_ds will be presented with examples from both datasets, resulting in one true annotation per hash in gold_ds?
Or perhaps is it the case that once an example is reviewed from teach_ds in the initial creation of gold_ds, it will not be presented again through the review recipe, potentially resulting in conflicting annotations in gold_ds?
In thinking all this through, I've been using the review recipe through the life of annotation collection rather than once at the end of annotation collection, and perhaps that's not the intended use case! Maybe merging correct_ds and teach_ds first, reviewing the resulting merged dataset, and then ner.correcting into merged and reviewing merged is the way to go, assuming ner.correct checks for input_hash in the dataset before presenting examples.
Thanks for the details, this definitely makes sense. I guess the main question is: Are you expecting more annotations for an already reviewed unique example to come in after you've already reviewed it? For instance, that you have example 123 and 3 different annotations for it, and then later on, another annotator creates another version of it?
In that case, you should end up seeing example 123 again, this time with 4 versions and Prodigy should consider it a different question (because you're asked to review 4 versions and not 3). If that's not the case, that's unideal and we shoud fix this by making the review stream generate unique hashes that take the "versions" into account (if you need that, you can probably hack it in by editing the make_eg method in review.py).
The only potential problem that can always occur is that your final and reviewed answer differs between reviews – e.g. if you typo or misclick, or just make different decisions. In that case, you end up with conflicting examples in gold_ds. It should be pretty rare, though, and also quite easy to check for programmatically.
So just to check my understanding, in this post you note that "by default, Prodigy will skip examples that are already in the dataset you're saving to: so if you've already reviewed the same example before, you won't be asked about it again." But the review recipe considers an example 123 (input hash xyz) with three conflicting annotations as distinct from example 123 (input hash xyz) with four conflicting annotations, and so the recipe will serve example 123 with four annotations, even though input hash xyz is already in the dataset?
I swear I'm not trying to belabor this-- I should probably just look at the code!
No worries! This type of stuff is a bit abstract and not always easy to describe and explain either.
By default, the exclusion is performed based on the task hash, not the input hash (but this can be configured by the exclude_by setting in the recipes, as you may want to use different strategies for different tasks). The hashes are the source of truth here and what Prodigy looks at to determine whether two examples are identical – nothing else. When you're generating the hashes, you can include the "versions" as one of the relevant keys for the task hash, so example 123 + 3 versions and example 123 + 4 versions both end up with the same input hashes, but different task hashes.