How to merge data from ner.correct and ner.teach?

Jeff77789 · November 6, 2020, 9:49pm

Quick question about the annotations I made using ner.correct and ner.teach

ner.correct has multiple annotations per doc and ner.teach creates binary accept/reject annotations

These two happen to be on the same dataset - with 1000 annotations this might only be about 400 specific examples or docs

When I run prodigy train on this dataset it shows:
"Created and merged data for 400 total examples"

How can this operation to be made to the dataset so that the annotation count can reflect the 400 unique examples? I tried ner.silver-to-gold but it doesn't seem to work since both outputs are in the same dataset

ines · November 9, 2020, 1:34am

Hi, I hope I understand your question correctly! When you run prodigy train, the examples in the dataset will be merged to reflect the unique examples, and all annotations that are available for a given example will be combined to create the final training example.

However, mixing annotations of different types (binary and manual) in the same dataset can sometimes lead to unexpected results and means you won't be able to update the model as effectively: to train from binary yes/no questions, you want to update differently and consider the rejected answers, while also treating all unannotated tokens as unknown. This is done when you set --binary on prodigy train. If you train from complete gold-standard annotations created with ner.correct, you typically want to consider all unannotated tokens as non-entity tokens, which makes it easier for the model to learn. So we typically recommend keeping those types of annotation separate.

So one option would be to just use the metadata of the exported annotations to separate them into two sets and then re-import the data. Also see this thread for more details:

Topic		Replies	Views
NER overlapping datasets, meaning of lack of annotation usage , ner , best-practices	1	1192	April 25, 2019
ner.train number of examples usage , ner	8	1948	August 3, 2018
Training a model on both gold and binary data usage , ner , done	11	1492	August 27, 2021
Placing Data in One Dataset usage , database , best-practices	6	1730	November 6, 2018
NER workflow / database questions usage , ner	4	759	July 19, 2020

How to merge data from ner.correct and ner.teach?

Related topics