NER workflow / database questions

Hi! Ideally, you should have separate datasets for the different annotations like binary and manual annotation, since you typically also want to use them differently during training and update your model differently, depending on whether you have sparse yes/no feedback on individual spans, or complete annotation where you know the correct answer for each token.

I've explained this in more detail on this thread, which probably answers some of the other follow-up questions as well:

This is mostly because we can use the regular evaluation metrics returned by spaCy for the non-binary training, whereas the binary training requires a different evaluation and for that, the recipe currently only outputs one score.

If you train with --binary: on all of them. If you train without: only on the examples that were accepted, and by default, they're assumed to be gold-standard with no missing values. That's also why mixing binary annotations and complete manual annotations in one dataset can be problematic: for binary annotations, you want to consider both accepted and rejected answers and consider all unannotated tokens missing values. For manual annotations, you typically want to assume that all tokens are annotated and that unannotated tokens are not entities – this will give you better and more reliable accuracy, because you know more "correct answers".

So if you mix them together, you're either disregarding what you know about the unannotated tokens in the manual data and treating them as missing values, or you're discarding rejected binary information and are treating accepted binary annotations as complete annotations (which would be incorrect).

What the annotations you collect "mean" is up to you and currently only decided when you train from them – for NER, the main decisions here are: 1) Should rejected spans be considered and used to update the model somehow? 2) What do unannotated tokens mean, are they not part of an entity, or is their label unknown?

So the underlying data would look the same: you have a "text", "spans" and an "answer" ("accept" or "reject"). And when you train from that data, you decide how to interpret it. That's also why you should use a different datasets for the binary annotations.

Yes, that's a new setting introduced in v1.10. It indicates whether a token is followed by a whitespace character or not (like spaCy's Token.whitespace_ attribute). This information allows tokens to be rendered while preserving the original whitespace. Here's an example that shows this in action for wordpiece tokens: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP