My NER workflow has been been to use ner.teach to create an initial model, then create a gold dataset for each document, export with db-out and concatenate all gold datasets and batch train a final model.
I initially did this for a set of documents and a few labels and now I’m adding another label and creating a new dataset for each new label/document pair, reusing the same document set.
How does prodigy interpret the same sentence appearing twice in the dataset with different labels? Does the lack of an annotation indicate that a token is definitely not part of an entity, or that it is unknown? Do the annotations occurring in the same text need to be merged prior to training?
I’m wondering if I have gold datasets for labels X, Y for documents 1-10 and gold datasets for label Z only for document 1-3 am I hurting performance by asserting that there are no occurrences of Z in 4-10?
When you run the built-in ner.batch-train, Prodigy will automatically merge all examples on the same input, i.e. the same text (determined by comparing the input hashes of the examples). The "spans" will then be merged together as well.
By default, the training process will assume that all missing values are unknown – so if there’s no entity annotation for a token, it’s treated as a missing value rather than an O token (definitely outside an entity). This allows training from binary annotations like the ones you collect in ner.teach. (To disable this behaviour and train from gold-standard annotations where you know that unannotated tokens are definitely not entities, you can set the --no-missing flag btw.)
To update the model with incomplete annotations, Prodigy essentially generates the best possible analysis of the example given the constraints defined by the annotations. If your data includes conflicting spans, those will have to be ignored – but if they contain different pieces of the information about the example, we can put this together and update the weights proportionally, even if we don’t know the full truth.
If you’re performing all those updates while treating unlabelled tokens as missing values, then you might actually improve accuracy because you’d be preventing the model from predicting Z where you definitely know it doesn’t occur. However, if you have gold-standard annotations, you might as well take advantage of that and update the model in a way that treats unlabelled tokens as O.
You might want to check out this example of a silver-to-gold workflow btw. It lets you create gold-standard from silver-standard data (e.g. binary annotations) by generating the best analysis and then correcting it manually if needed.