ner.train number of examples

When looking at the examples, it's important to distinguish between the input hash and the task hash. The input hash is based on only the input data, e.g. the text. So you can easily have lots of annotations on the same input, but with different spans. Prodigy won't replace examples in the dataset – your dataset should always be an exact record of each individual annotation decision. But when you run ner.batch-train, Prodigy will merge all spans on the same input and use the annotation decisions to generate a set of "constraints" for each text. For example: existing entities, spans that we know are not entities of a certain label etc.

That's also why I would usually recommend working with separate datasets for different annotations. The annotations you collect with ner.teach are binary decisions on different spans, whereas in ner.make-gold, you're labelling the whole sentence until it's complete. It's usually cleaner to keep those separate, and it also makes it easy to try out different things and run experiments with different configurations. For example: Does your model improve if you add manually labelled examples? Whats' the accuracy if you only train from manual labels? What if you add more examples with binary accept/reject decisions? All of that becomes difficult if you only have one dataset.

There are mostly two different ways to interpret the annotations you've collected. None of this is "inherent" to the annotations themselves – it all comes down to how you interpret the data and how you train from it later on.

  • Partial or "sparse" annotations: Those are annotations like the ones you collect with ner.teach or other recipes that ask for binary feedback on pre-labelled spans. If you say "yes" to a highlighted span, we only know that this one is correct – but we know nothing about the rest of the tokens. Maybe there's another entity in there, maybe there isn't. So by default, Prodigy will assume that all unlabelled tokens are unknown when training the model.
  • Complete, "gold-standard" annotations: Those are annotations that describe the "perfect" and complete analysis of the given text. If only one entity is labelled, we know that all other tokens are not part of an entity. If there's no spans at all, we know that the text doesn't have any entities. In the latest version of Prodigy, you can set the --no-missing flag on ner.teach to specify that all annotations you're training from should be considered complete and don't have missing or unknown labels.

How many labels did you annotate? Results like 99% are always suspicious – it often indicates that the model learned the wrong thing, or that you're evaluating on a set that includes your training examples. It's difficult to analyse what happened here without more details, but it's possible that some of the results you saw were a side-effect of your dataset containing a mix of annotation types and different labels etc.

Once you get more serious about training and evaluating the your models, it's always a good idea to create a dedicated evaluation set. This lets you run more reliable and repeatable experiments. You can either use ner.make-gold or ner.manual, create a new dataset, stream in a sample of your data that you won't use for training and label all entities. When you run ner.batch-train, you can then set --eval-id [dataset name]. Instead of just holding back a certain percentage from the training set, Prodigy will now evaluate against your evaluation set.