Duplicated examples in db-out for ner.train

hi @vsocrates!

Thanks for your question and welcome to the Prodigy community :wave:

Said differently, are you asking how are duplicates handled in prodigy train? That's a great question because by default, prodigy train will drop duplicates based on the input_hash.

Here's an old post that ways the number of examples in the dataset may be different than what is trained.

From this post, you can run this snippet to check on the unique input hashes:

from prodigy.components.db import connect
db = connect()
input_hashes = db.get_input_hashes(["dataset_name"])
print(len(set(input_hashes)))

Also, consider adding PRODIGY_LOGGING=basic to see if anything else is being skipped.

Just curious, did each annotator annotate all of the examples or were all records annotated only once?

Related, were you aware of multi-user sessions to annotate the data? If so, did you modify feed_overlap?

Ideally, you could have used the multi-user sessions to identify who labeled what and then used the feed_overlap to determine whether you wanted each annotator to annotate all of the examples ("feed_overlap" = true) or send out each example in the data once to whoever is available ("feed_overlap" = false, which is the default behavior).

This is a good point! I'll make a note and consider adding more documentation or creating some future content that mentions this point.

One last small point -- while the train and data-to-spacy functions dedup based on input_hash, when reading in data for annotation, recipes by default exclude duplicates by task_hash (see config options). This can be modified by changing "exclude_by" to "input" in the config or as an override.

Let us know if you have any further questions!