NER overlapping datasets, meaning of lack of annotation

When you run the built-in ner.batch-train, Prodigy will automatically merge all examples on the same input, i.e. the same text (determined by comparing the input hashes of the examples). The "spans" will then be merged together as well.

By default, the training process will assume that all missing values are unknown – so if there's no entity annotation for a token, it's treated as a missing value rather than an O token (definitely outside an entity). This allows training from binary annotations like the ones you collect in ner.teach. (To disable this behaviour and train from gold-standard annotations where you know that unannotated tokens are definitely not entities, you can set the --no-missing flag btw.)

To update the model with incomplete annotations, Prodigy essentially generates the best possible analysis of the example given the constraints defined by the annotations. If your data includes conflicting spans, those will have to be ignored – but if they contain different pieces of the information about the example, we can put this together and update the weights proportionally, even if we don't know the full truth.

My slides here show an example of this process.

If you're performing all those updates while treating unlabelled tokens as missing values, then you might actually improve accuracy because you'd be preventing the model from predicting Z where you definitely know it doesn't occur. However, if you have gold-standard annotations, you might as well take advantage of that and update the model in a way that treats unlabelled tokens as O.

You might want to check out this example of a silver-to-gold workflow btw. It lets you create gold-standard from silver-standard data (e.g. binary annotations) by generating the best analysis and then correcting it manually if needed.

1 Like