Hi! These are all good questions and it's always good to consider these questions explicitly for each project
The annotations don't necessarily have to be on the same sentences, although it's usually good to have at least some overlap. Otherwise, you can more easily end up with imbalanced data, and you'll also never have examples of texts with multiple different entities, which could mean that there's less useful information for the model to learn from, and more unknowns.
Binary annotations can definitely be very useful for moving your model into a better direction and correcting and as you say, they're really efficient to create and often include specific examples that the model can get the most value out of.
That said, if you're training a new model from scratch, it's usually good to focus on a reasonably sized corpus of complete annotations as an end goal, and you can use your binary annotations to create an intermediate model to help with that. Instead of converting you binary annotations, another thing you could do is train a model using the data you already have, and then use it with ner.correct
. If your model is pretty good already, this can also be extremely fast, because you only have to correct what the model gets wrong. So you can easily build up a very large corpus of complete, gold-standard annotations without having to do much manual labelling at all.
In Prodigy v1.10, the training mechanism for manual and binary works requires different logic, so you'll have to run the training separately. Ideally, you'd start with the manual annotations first, because those give you more complete information for the model to learn from, especially if you're starting from scratch.
In Prodigy v1.11 (currently available as a nightly pre-release), you'll be able to train from both manual and binary annotations jointly, and it's also something we'd recommend. You'll be able to get better results if your annotations inclue at least some complete examples, on top of the binary decisions.
When you're training the model, you should ideally train on all binary datasets together. Prodigy will take care of merging all annotations on the same input, so if a sentence contains binary annotations for two labels, the model will be updated with both of this information together.
The --ner-missing
flag is really only intended for non-binary annotations where you want to consider all unannotated tokens as "missing values" (as opposed to "not an entity", which is typically the default). This is already included when you train from --binary
, because binary annotations always mean that you only know the answer for one particular token sequence, and nothing about all other unannotated tokens.