hi @vsocrates!
Thanks for your question and welcome to the Prodigy community
Said differently, are you asking how are duplicates handled in prodigy train
? That's a great question because by default, prodigy train
will drop duplicates based on the input_hash
.
Here's an old post that ways the number of examples in the dataset may be different than what is trained.
From this post, you can run this snippet to check on the unique input hashes:
from prodigy.components.db import connect
db = connect()
input_hashes = db.get_input_hashes(["dataset_name"])
print(len(set(input_hashes)))
Also, consider adding PRODIGY_LOGGING=basic
to see if anything else is being skipped.
Just curious, did each annotator annotate all of the examples or were all records annotated only once?
Related, were you aware of multi-user sessions to annotate the data? If so, did you modify feed_overlap
?
Ideally, you could have used the multi-user sessions
to identify who labeled what and then used the feed_overlap
to determine whether you wanted each annotator to annotate all of the examples ("feed_overlap" = true
) or send out each example in the data once to whoever is available ("feed_overlap" = false
, which is the default behavior).
This is a good point! I'll make a note and consider adding more documentation or creating some future content that mentions this point.
One last small point -- while the train
and data-to-spacy
functions dedup based on input_hash
, when reading in data for annotation, recipes by default exclude duplicates by task_hash
(see config options). This can be modified by changing "exclude_by"
to "input"
in the config or as an override.
Let us know if you have any further questions!