I have an EC2 instance that runs Prodigy sessions where multiple users annotate (each with a unique session ID) and which saves annotated examples to a remote host. I configured the feed overlap to be false, and exclude by input so that we don't get duplicates in our datasets.
We just corrected a batch of over 1200 examples and found over 300 duplicates. Some of these had been annotated multiple times by different users, sometimes an example had been annotated multiple times by the same user. I dug through some of our other datasets and found a similar trend.
I am using a custom ner.correct recipe, that uses a custom implementation of sentence splitting to accomplish this: https://support.prodi.gy/t/split-sents-threshold-setting-not-working-with-custom-ner-correct/3001/8. Here is the code for it:
def split_sentences(stream, nlp, frag_length, hash_type="_input_hash"):
'''Yields tasks for tasks in stream. Input task texts are split into smaller samples, but assigned their
input hash before being split. This produces a one-to-many mapping of hashes to samples.
'''
tuples = ((eg["text"], eg) for eg in stream)
for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
orig_eg = copy.deepcopy(orig_eg)
#Get input hash for the entire text of this sample, before we process and split it
#This avoids duplicates in our datasets. Processing/splitting techniques can change, if we hash after
#the fact we may get duplicates.
hash = set_hashes(orig_eg, overwrite=True)[hash_type]
sents = ""
for i, sent in enumerate(doc.sents):
eg = copy.deepcopy(orig_eg)
if i > 0:
sents += " " + sent.text
else:
sents += sent.text
if len(sents) > frag_length: #Send out this example, and start rebuilding the next one
eg["text"] = copy.deepcopy(sents)
#Set other hashes (for example, task hash)
eg = set_hashes(eg, overwrite=True)
#Manually set hash (of the type we care to track) to hash for complete sample.
eg[hash_type] = hash
sents = ""
yield eg
eg["text"] = copy.deepcopy(sents)
eg = set_hashes(eg, overwrite=True)
eg[hash_type] = hash
yield eg
Most of the time, my input hashing technique works so my hunch is that the logic of that is working.
Hi! To make it easier to investigate this, which version of Prodigy are you using? (And were you always using the same version during the annotation process or did you upgrade in between?)
Hey! We're on 1.10.3. We've upgrade a few times since starting out with Prodigy, this issue has to have been around for most-to-all of the versions.
The annotators and myself also noticed that redundant samples (within the same session) are being served up without any pre-highlighted entity spans. The first sample is tagged, the rest aren't.
Okay, then at least it's unlikely to be caused by any recent change. If you look at the hashes of the duplicate examples, do they have different hashes?
I haven't run your sentence segmentation code yet but one thing I noticed: your logic yields twice, and the second yield is not behind a conditional. yield doesn't behave like return in this way, so if the first conditional is met, eg is sent out twice:
for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
# ...
for i, sent in enumerate(doc.sents):
# ...
if len(sents) > frag_length:
# ...
yield eg # <-- first time
yield eg # <-- second time
I think you were right...there was something with that second yield. It didn't make sense to me at first, because that should only yield remaining sentences (ones that did not fall into a prior segment). If there are none, it should yield an example with {"text":""}. But that wasn't happening...it was yielding an example with the full text of the prior segment.
I added a condition, where that second yield only executes if there are left-over sentences, and it works.
def split_sentences(stream, nlp, frag_length, hash_type="_input_hash"):
'''Yields tasks for tasks in stream. Input task texts are split into smaller samples, but assigned their
input hash before being split. This produces a one-to-many mapping of hashes to samples.
'''
tuples = ((eg["text"], eg) for eg in stream)
for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
#Get input hash for the entire text of this sample, before we process and split it
#This avoids duplicates in our datasets. Processing/splitting techniques can change, if we hash after
#the fact we may get duplicates.
hash = set_hashes(orig_eg, overwrite=True)[hash_type]
sents = ""
for i, sent in enumerate(doc.sents):
eg = copy.deepcopy(orig_eg)
if i > 0:
sents += " " + sent.text
else:
sents += sent.text
if len(sents) > frag_length: #Send out this example, and start rebuilding the next one
eg["text"] = sents
#Set other hashes (for example, task hash)
eg = set_hashes(eg, overwrite=True)
#Manually set hash (of the type we care to track) to hash for complete sample.
eg[hash_type] = hash
sents = ""
yield eg
if len(sents) > 0:
#There are remaining sentences in the example to send out
eg["text"] = sents
eg = set_hashes(eg, overwrite=True)
eg[hash_type] = hash
yield eg
Sorry, spoke too soon. Our annotators worked in the EC2 instance (with the sentence segmentation code updated) on a multi-user session and I'm seeing lots of duplicates in the dataset. Some of these were annotated redundantly by multiple users, sometimes by the same user.
Maybe log and inspect the examples that go out and their hashes?
Now that I'm looking at the updated code, are you sure it's correct? if len(sents) > 0, you're sending out the example again, which will always be True if the above for loop runs, because you're updating the variable sents? So if there's more than one sentence, you yield the example twice?
I actually logged the hashes yesterday...a lot of them do have the same input hash, but that's to be expected because of how that algorithm works. It creates an input hash for a pre-segmented chunk of text, then assigns that same hash to each segment.
I think the code is correct, maybe I'm missing something. It should never yield the same example twice...the first yield only executes after sents gets reset to the empty string. This should mean that the final yield gives new sentences, but only yields if there are new sentences to provide. Is that inconsistent with what you're seeing?
Ah, I think I missed the sents = ""! Nevermind then!
Is your goal to treat those examples as identical and skip them if they re-occur? And if so, are you setting "exclude_by": "input" in your config?
I think the main question and aspect to check would be: Does the stream itself produce duplicate examples and if so, do those examples have the same hashes, or different hashes? And are there any examples that should be treated as identical but are not? (It's usually best to check this at the Python level and not in the app so you don't have any side-effects from how the data is sent back and forth.)
The goal there is a little complicated...The sentence segmentation works by reconstructing segments after Spacy's tokenizer has split them up. I upgraded a Spacy version, and found that I was getting duplicates in my datasets because (I'm guessing) some of the tokenization logic changed, which meant that segments that differed by a few characters or a sentence were getting different input hashes. Another way that I can get near-duplicates is if I alter the max segment length...I'd like to avoid having samples that only differ by a sentence.
In order to prevent that, I pre-emptively create one input hash that every segment with a common original example shares. Now, when I exclude an annotated dataset I'll never be served examples that have been annotated before, regardless of whether tokenization logic changes or whether I change the fragment length. I am using "exclude_by": "input" to make this all work.
Does this screw up how the feed_overlap works? I'm also wondering if the problem may be with corrections. If a recipe like the NER review creates a new input hash, we could be getting duplicates that way...where the exclude function no longer works as intended. I can't find any duplicates in the stream, but will test what happens with my custom hashing implementation when I correct a dataset with the review recipe.
It looks like the use of the review recipe is definitely creating problems. I'll create a fix for that and get our annotators working again, will let you know if the problem persists.
Thanks for the detailed explanation! From what you describe, I think it's definitely possible that it comes down to small differences in the incoming data that cause the hashes to differ for examples that you want to consider the same in your particular use case. So using your own custom hashing that reflects exactly what you consider identical probably makes sense here and it'd also make it easier for you to reason about the hashes that are created and whether there's a problem somewhere.
The review recipe wil rehash the tasks before merging and sending them out to make sure that they're up to date, because it heavily relies on identifying which annotations belong to the same input so it can group them together. So the final reviewed example will likely end up with a different task hash than the initial suggested and annotated example.