NER workflow / database questions

I'm working on creating a named entity model for one label. I've been doing this with a combination of ner.manual / ner.teach and external annotation, which I imported with db-in

These steps do produce a result, which is great, but also left me with some questions on the optimal workflow:

  • I'm saving the results of both ner.manual and ner.teach to the same dataset, is this the preferred option? Or should I split the training files for binary (ner.teach) and manual annotations?
  • If I use the train recipe without the --binary option it shows precision/recall/fscore. With the --binary option it shows the accuracy. What is the reason for this difference?
  • On which annoted items is the model training? Both binary and non-binary training train on a different number of records which don't seem to correspond with 100% or 80% of the actual number of items annotated with ner.manual or ner.teach
  • if I import external data, I use the jsonl format as described in the ner-manual annotation interface ( with the addition of "answer": accept/reject. Should I also add (and if so, how?) that this is a binary annotation? And is it a problem that this format does not match the current format of the exported db-out file?
  • In the exported db-out file, per token we have a "ws" indicator that can be either True or False. What is the meaning of this indicator?

It would be great to get some insights from this forum with regards to these questions!

Hi! Ideally, you should have separate datasets for the different annotations like binary and manual annotation, since you typically also want to use them differently during training and update your model differently, depending on whether you have sparse yes/no feedback on individual spans, or complete annotation where you know the correct answer for each token.

I've explained this in more detail on this thread, which probably answers some of the other follow-up questions as well:

This is mostly because we can use the regular evaluation metrics returned by spaCy for the non-binary training, whereas the binary training requires a different evaluation and for that, the recipe currently only outputs one score.

If you train with --binary: on all of them. If you train without: only on the examples that were accepted, and by default, they're assumed to be gold-standard with no missing values. That's also why mixing binary annotations and complete manual annotations in one dataset can be problematic: for binary annotations, you want to consider both accepted and rejected answers and consider all unannotated tokens missing values. For manual annotations, you typically want to assume that all tokens are annotated and that unannotated tokens are not entities – this will give you better and more reliable accuracy, because you know more "correct answers".

So if you mix them together, you're either disregarding what you know about the unannotated tokens in the manual data and treating them as missing values, or you're discarding rejected binary information and are treating accepted binary annotations as complete annotations (which would be incorrect).

What the annotations you collect "mean" is up to you and currently only decided when you train from them – for NER, the main decisions here are: 1) Should rejected spans be considered and used to update the model somehow? 2) What do unannotated tokens mean, are they not part of an entity, or is their label unknown?

So the underlying data would look the same: you have a "text", "spans" and an "answer" ("accept" or "reject"). And when you train from that data, you decide how to interpret it. That's also why you should use a different datasets for the binary annotations.

Yes, that's a new setting introduced in v1.10. It indicates whether a token is followed by a whitespace character or not (like spaCy's Token.whitespace_ attribute). This information allows tokens to be rendered while preserving the original whitespace. Here's an example that shows this in action for wordpiece tokens:

Thank you very much for your quick and clear response Ines.

Based on your answer I restarted my annotation with ner.manual. The 'entity' I'm looking for in my training texts (in Dutch) is 'below average iq'.

In the manual training I annotated texts like 'iq is 62' and many instances of 'iq is lower than average' (in slightly different forms / typos etc.). In total I annotated about 1.000 texts.

As a next stepI used ner.teach (writing to a new dataset) in order to get more feeling what the trained model is doing. While doing this I noticed that the model is returning sentences like 'about average iq', and even 'iq is above average' with a 1.0 score. This obviously is the opposite of what I'm looking for.

I understand the context of 'above average iq' and 'below average iq' is quite similar to a model, but I'm wondering if you could offer any suggestions on how to deal with qualifiers such as higher / lower / in the task I'm describing.


Hi Roel,

I think the way you're annotating is probably setting up the problem to take much more data than an alternative approach might. You're using the NER model which tries to learn the exact token boundaries, so you're teaching it something very specific about something that's only incidental to your concerns, which means the model will require many more examples. It will probably need to see a different example for every phrasing, it's unlikely to generalise very well to the meaning in the way you're interested in.

I would suggest you switch to a sentence classification approach. If you can have a rule-based preprocess that only works on sentences that mention "IQ", "intelligence" and a few other keywords, you can use a sentence classifier to make the decision about whether the sentence says the IQ is above or below average. If you have many sentences with multiple IQs listed you could deal with those separately, but I'm guessing your data usually will only have one per sentence?

Thank you Honnibal. We'll give sentence classification a go!

An alternative might be to train for the concept of 'IQ' and then use Spacy in a seperate step to discover the qualifier.