Duplicate examples being served to annotators and being saved to DB

alexf_a · August 26, 2020, 4:30pm

I have an EC2 instance that runs Prodigy sessions where multiple users annotate (each with a unique session ID) and which saves annotated examples to a remote host. I configured the feed overlap to be false, and exclude by input so that we don't get duplicates in our datasets.

We just corrected a batch of over 1200 examples and found over 300 duplicates. Some of these had been annotated multiple times by different users, sometimes an example had been annotated multiple times by the same user. I dug through some of our other datasets and found a similar trend.

I am using a custom ner.correct recipe, that uses a custom implementation of sentence splitting to accomplish this: https://support.prodi.gy/t/split-sents-threshold-setting-not-working-with-custom-ner-correct/3001/8. Here is the code for it:

def split_sentences(stream, nlp, frag_length, hash_type="_input_hash"):
    '''Yields tasks for tasks in stream. Input task texts are split into smaller samples, but assigned their
    input hash before being split. This produces a one-to-many mapping of hashes to samples.
    '''
    tuples = ((eg["text"], eg) for eg in stream)
    for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
        orig_eg = copy.deepcopy(orig_eg)
        #Get input hash for the entire text of this sample, before we process and split it
        #This avoids duplicates in our datasets. Processing/splitting techniques can change, if we hash after
        #the fact we may get duplicates.
        hash = set_hashes(orig_eg, overwrite=True)[hash_type]
        sents = ""
        for i, sent in enumerate(doc.sents):
            eg = copy.deepcopy(orig_eg)
            if i > 0:
                sents += " " + sent.text
            else:
                sents += sent.text
            if len(sents) > frag_length: #Send out this example, and start rebuilding the next one
                eg["text"] = copy.deepcopy(sents)
                #Set other hashes (for example, task hash)
                eg = set_hashes(eg, overwrite=True)
                #Manually set hash (of the type we care to track) to hash for complete sample.
                eg[hash_type] = hash
                sents = ""
                yield eg
        eg["text"] = copy.deepcopy(sents)
        eg = set_hashes(eg, overwrite=True)
        eg[hash_type] = hash
        yield eg

Most of the time, my input hashing technique works so my hunch is that the logic of that is working.

ines · August 27, 2020, 8:57am

Hi! To make it easier to investigate this, which version of Prodigy are you using? (And were you always using the same version during the annotation process or did you upgrade in between?)

alexf_a · August 27, 2020, 1:42pm

Hey! We're on 1.10.3. We've upgrade a few times since starting out with Prodigy, this issue has to have been around for most-to-all of the versions.

The annotators and myself also noticed that redundant samples (within the same session) are being served up without any pre-highlighted entity spans. The first sample is tagged, the rest aren't.

ines · August 27, 2020, 3:03pm

Okay, then at least it's unlikely to be caused by any recent change. If you look at the hashes of the duplicate examples, do they have different hashes?

I haven't run your sentence segmentation code yet but one thing I noticed: your logic yields twice, and the second yield is not behind a conditional. yield doesn't behave like return in this way, so if the first conditional is met, eg is sent out twice:

for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
    # ...
    for i, sent in enumerate(doc.sents):
        # ...
        if len(sents) > frag_length:
            # ...
                yield eg  # <-- first time
        yield eg  # <-- second time

alexf_a · August 27, 2020, 6:33pm

I think you were right...there was something with that second yield. It didn't make sense to me at first, because that should only yield remaining sentences (ones that did not fall into a prior segment). If there are none, it should yield an example with {"text":""}. But that wasn't happening...it was yielding an example with the full text of the prior segment.

I added a condition, where that second yield only executes if there are left-over sentences, and it works.

def split_sentences(stream, nlp, frag_length, hash_type="_input_hash"):
    '''Yields tasks for tasks in stream. Input task texts are split into smaller samples, but assigned their
    input hash before being split. This produces a one-to-many mapping of hashes to samples.
    '''
    tuples = ((eg["text"], eg) for eg in stream)
    for doc, orig_eg in nlp.pipe(tuples, as_tuples=True):
        #Get input hash for the entire text of this sample, before we process and split it
        #This avoids duplicates in our datasets. Processing/splitting techniques can change, if we hash after
        #the fact we may get duplicates.
        hash = set_hashes(orig_eg, overwrite=True)[hash_type]
        sents = ""
        for i, sent in enumerate(doc.sents):
            eg = copy.deepcopy(orig_eg)
            if i > 0:
                sents += " " + sent.text
            else:
                sents += sent.text
            if len(sents) > frag_length: #Send out this example, and start rebuilding the next one
                eg["text"] = sents
                #Set other hashes (for example, task hash)
                eg = set_hashes(eg, overwrite=True)
                #Manually set hash (of the type we care to track) to hash for complete sample.
                eg[hash_type] = hash
                sents = ""
                yield eg
        if len(sents) > 0:
            #There are remaining sentences in the example to send out
            eg["text"] = sents
            eg = set_hashes(eg, overwrite=True)
            eg[hash_type] = hash
            yield eg

alexf_a · August 28, 2020, 2:30pm

Sorry, spoke too soon. Our annotators worked in the EC2 instance (with the sentence segmentation code updated) on a multi-user session and I'm seeing lots of duplicates in the dataset. Some of these were annotated redundantly by multiple users, sometimes by the same user.

Anything else I could try?

ines · August 28, 2020, 2:41pm

Maybe log and inspect the examples that go out and their hashes?

Now that I'm looking at the updated code, are you sure it's correct? if len(sents) > 0, you're sending out the example again, which will always be True if the above for loop runs, because you're updating the variable sents? So if there's more than one sentence, you yield the example twice?

alexf_a · August 28, 2020, 3:18pm

I actually logged the hashes yesterday...a lot of them do have the same input hash, but that's to be expected because of how that algorithm works. It creates an input hash for a pre-segmented chunk of text, then assigns that same hash to each segment.

I think the code is correct, maybe I'm missing something. It should never yield the same example twice...the first yield only executes after sents gets reset to the empty string. This should mean that the final yield gives new sentences, but only yields if there are new sentences to provide. Is that inconsistent with what you're seeing?

ines · August 28, 2020, 5:55pm

Ah, I think I missed the sents = ""! Nevermind then!

Is your goal to treat those examples as identical and skip them if they re-occur? And if so, are you setting "exclude_by": "input" in your config?

I think the main question and aspect to check would be: Does the stream itself produce duplicate examples and if so, do those examples have the same hashes, or different hashes? And are there any examples that should be treated as identical but are not? (It's usually best to check this at the Python level and not in the app so you don't have any side-effects from how the data is sent back and forth.)

alexf_a · August 28, 2020, 7:36pm

The goal there is a little complicated...The sentence segmentation works by reconstructing segments after Spacy's tokenizer has split them up. I upgraded a Spacy version, and found that I was getting duplicates in my datasets because (I'm guessing) some of the tokenization logic changed, which meant that segments that differed by a few characters or a sentence were getting different input hashes. Another way that I can get near-duplicates is if I alter the max segment length...I'd like to avoid having samples that only differ by a sentence.

In order to prevent that, I pre-emptively create one input hash that every segment with a common original example shares. Now, when I exclude an annotated dataset I'll never be served examples that have been annotated before, regardless of whether tokenization logic changes or whether I change the fragment length. I am using "exclude_by": "input" to make this all work.

Does this screw up how the feed_overlap works? I'm also wondering if the problem may be with corrections. If a recipe like the NER review creates a new input hash, we could be getting duplicates that way...where the exclude function no longer works as intended. I can't find any duplicates in the stream, but will test what happens with my custom hashing implementation when I correct a dataset with the review recipe.

alexf_a · August 28, 2020, 7:55pm

It looks like the use of the review recipe is definitely creating problems. I'll create a fix for that and get our annotators working again, will let you know if the problem persists.

ines · August 30, 2020, 7:47pm

Thanks for the detailed explanation! From what you describe, I think it's definitely possible that it comes down to small differences in the incoming data that cause the hashes to differ for examples that you want to consider the same in your particular use case. So using your own custom hashing that reflects exactly what you consider identical probably makes sense here and it'd also make it easier for you to reason about the hashes that are created and whether there's a problem somewhere.

The review recipe wil rehash the tasks before merging and sending them out to make sure that they're up to date, because it heavily relies on identifying which annotations belong to the same input so it can group them together. So the final reviewed example will likely end up with a different task hash than the initial suggested and annotated example.

Topic		Replies	Views
Duplicates in ner.correct in 1.10.2 done , streams	3	524	August 10, 2020
Multiple Sessions duplicated data usage	1	524	July 24, 2019
Examples from stream are shown twice usage , custom , streams	13	651	October 26, 2021
Duplicated annotation when changing version ner , spacy	6	556	November 9, 2022
Duplicate examples when loaded in separate batches usage , streams	5	913	November 2, 2020

Duplicate examples being served to annotators and being saved to DB

Related topics