Duplicated examples in db-out for ner.train

This is a classic problem, but I haven't been able to find it in the documentation. I've got a dataset annotated using ner.correct/make-gold with three annotators. I first import the data using prodigy db-in and then train using prodigy train --ner. What happens to the duplicated examples (they have the exact same input_hash and task_hashes)?

Thanks in advance!

hi @vsocrates!

Thanks for your question and welcome to the Prodigy community :wave:

Said differently, are you asking how are duplicates handled in prodigy train? That's a great question because by default, prodigy train will drop duplicates based on the input_hash.

Here's an old post that ways the number of examples in the dataset may be different than what is trained.

From this post, you can run this snippet to check on the unique input hashes:

from prodigy.components.db import connect
db = connect()
input_hashes = db.get_input_hashes(["dataset_name"])

Also, consider adding PRODIGY_LOGGING=basic to see if anything else is being skipped.

Just curious, did each annotator annotate all of the examples or were all records annotated only once?

Related, were you aware of multi-user sessions to annotate the data? If so, did you modify feed_overlap?

Ideally, you could have used the multi-user sessions to identify who labeled what and then used the feed_overlap to determine whether you wanted each annotator to annotate all of the examples ("feed_overlap" = true, which is the default behavior) or send out each example in the data once to whoever is available ("feed_overlap" = false).

This is a good point! I'll make a note and consider adding more documentation or creating some future content that mentions this point.

One last small point -- while the train and data-to-spacy functions dedup based on input_hash, when reading in data for annotation, recipes by default exclude duplicates by task_hash (see config options). This can be modified by changing "exclude_by" to "input" in the config or as an override.

Let us know if you have any further questions!

Thanks so much for the fast and detailed answer!!

Yes, that is what I was asking. I see, so in essence, prodigy train arbitrarily chooses one of the annotations to go with? The link you provided seems to suggest that the span annotations would be merged instead though, and not dropped. Please let me know which one is the case and how we can control the behavior?

I believe it was a mix, as not all examples were annotated by all annotators (mostly due to time/cost limitations), but there are definitely duplicate annotations.

We did use multi-user sessions, and have not modified feed_overlap, so it looks like that lines up!

hi @vsocrates!

So, it's both because the order is important. Thanks for clarifying as I can see how my initial response could be a bit confusing.

For each input/doc, it'll first merge the entities into 1 input_hash. Since all of the entities for that input/doc have been merged into the same input_hash, later it'll do the default checks for dups by input_hash later, it will keep the merged entities as they have been merged into input_hash.

If it did the dedups by input_hash before merging the entities, then it wouldn't work the same. So the order is important.

If a given input_hash (input/doc) has different annotations from different user interfaces (e.g., some are for binary annotations like ner.teach and some are from manual annotations like ner.manual). In that case, the logic prefers manual (non-binary) labels so the binary annotations will be dropped and the manual annotations will be used for that input_hash. I don't think you're asking about this but good to be aware of.

Let me know if this clears up the confusion.

hi @ryanwesslen, apologies for the late reply, I got distracted by a few other tasks! This makes a lot of sense, thank you!

I think I'm all set for now, but I'll give this a go and report back if I run into anything else I don't understand!