Quick question about the annotations I made using ner.correct and ner.teach
ner.correct has multiple annotations per doc and ner.teach creates binary accept/reject annotations
These two happen to be on the same dataset - with 1000 annotations this might only be about 400 specific examples or docs
When I run prodigy train on this dataset it shows:
"Created and merged data for 400 total examples"
How can this operation to be made to the dataset so that the annotation count can reflect the 400 unique examples? I tried ner.silver-to-gold but it doesn't seem to work since both outputs are in the same dataset
Hi, I hope I understand your question correctly! When you run prodigy train, the examples in the dataset will be merged to reflect the unique examples, and all annotations that are available for a given example will be combined to create the final training example.
However, mixing annotations of different types (binary and manual) in the same dataset can sometimes lead to unexpected results and means you won't be able to update the model as effectively: to train from binary yes/no questions, you want to update differently and consider the rejected answers, while also treating all unannotated tokens as unknown. This is done when you set --binary on prodigy train. If you train from complete gold-standard annotations created with ner.correct, you typically want to consider all unannotated tokens as non-entity tokens, which makes it easier for the model to learn. So we typically recommend keeping those types of annotation separate.
So one option would be to just use the metadata of the exported annotations to separate them into two sets and then re-import the data. Also see this thread for more details: