ner.train number of examples


I’m fairly new to spacy, and prodigy. I have an NER model I’ve bootstrapped with patterns and a a round or two of ner.teach.

When I finished with ner.teach Prodigy lists that I have around 2185 “Total” under progress. I assume this means I have 2185 annotations that have been marked accept or reject by me.

Yet, when I run batch-train it only lists the following stats and I’m having trouble understanding why it seems all my annotations aren’t being used to train the model.

Using 20% of accept/reject examples (69) for evaluation
Using 100% of remaining examples (292) for training
Dropout: 0.2  Batch size: 8  Iterations: 10

BEFORE     0.000
Correct    0
Incorrect  137
Entities   468
Unknown    0

Yes, that’s how it should be. The “total” is the total number of annotations in your dataset – it should also match the number of lines that are exported when you run the db-out command.

The training recipes will exclude the examples you’ve marked as “ignore”, but I don’t think that’s significant here (unless most of your annotations were ignored, which seems unlikely?)

Did your annotations include many entities on the same text? Since ner.teach is binary, each annotation decision you make is its own example – but before training, Prodigy will merge all span annotations on the same input into one training example. You can check this by re-running the command with the environment variable PRODIGY_LOGGING=basic. This should output a log statement like “MODEL: Merging entity spans of X examples”. 361 vs. ~2000+ still seems a little high, though.

If this isn’t the explanation, is there anything else in the log that potentially looks suspicious? And can you double-check that the dataset you’re using in ner.batch-train is the correct one and actually exports 2185 annotations?

1 Like


That helps alot. It was just merging the examples due to multiple labels, and an a mistake with teaching on a small set of examples which got represented multiple times in the db, I’m guessing.

Another question about workflow. Lets say I bootstrapped the model with patterns, then batch-train, and then did a few rounds of teach, then batch-trained again.

Then I decided to go through and clean up the examples with make-gold. Im curious how prodigy/spacy deal with the possible duplicate examples in the db. I did a make-gold of just one example, then looked at the db-out output and noticed the same input listed 3 times, with the same input_hash:

One had a single span reject, one had a single span accept (likely both from a rounds of ner.teach) and then finally a fully accepted gold version with all the spans and tokens.

I assume I don’t have to go in and do any clean up and prodigy knows how to deal with these items when training?

Does this workflow make sense? (I did notice training on one of my labels at one point was near 99% accuracy according to batch-train, but now after more rounds of teach on different labels is down to 25% even when training just that label, so I’m a bit confused.)


When looking at the examples, it’s important to distinguish between the input hash and the task hash. The input hash is based on only the input data, e.g. the text. So you can easily have lots of annotations on the same input, but with different spans. Prodigy won’t replace examples in the dataset – your dataset should always be an exact record of each individual annotation decision. But when you run ner.batch-train, Prodigy will merge all spans on the same input and use the annotation decisions to generate a set of “constraints” for each text. For example: existing entities, spans that we know are not entities of a certain label etc.

That’s also why I would usually recommend working with separate datasets for different annotations. The annotations you collect with ner.teach are binary decisions on different spans, whereas in ner.make-gold, you’re labelling the whole sentence until it’s complete. It’s usually cleaner to keep those separate, and it also makes it easy to try out different things and run experiments with different configurations. For example: Does your model improve if you add manually labelled examples? Whats’ the accuracy if you only train from manual labels? What if you add more examples with binary accept/reject decisions? All of that becomes difficult if you only have one dataset.

There are mostly two different ways to interpret the annotations you’ve collected. None of this is “inherent” to the annotations themselves – it all comes down to how you interpret the data and how you train from it later on.

  • Partial or “sparse” annotations: Those are annotations like the ones you collect with ner.teach or other recipes that ask for binary feedback on pre-labelled spans. If you say “yes” to a highlighted span, we only know that this one is correct – but we know nothing about the rest of the tokens. Maybe there’s another entity in there, maybe there isn’t. So by default, Prodigy will assume that all unlabelled tokens are unknown when training the model.
  • Complete, “gold-standard” annotations: Those are annotations that describe the “perfect” and complete analysis of the given text. If only one entity is labelled, we know that all other tokens are not part of an entity. If there’s no spans at all, we know that the text doesn’t have any entities. In the latest version of Prodigy, you can set the --no-missing flag on ner.teach to specify that all annotations you’re training from should be considered complete and don’t have missing or unknown labels.

How many labels did you annotate? Results like 99% are always suspicious – it often indicates that the model learned the wrong thing, or that you’re evaluating on a set that includes your training examples. It’s difficult to analyse what happened here without more details, but it’s possible that some of the results you saw were a side-effect of your dataset containing a mix of annotation types and different labels etc.

Once you get more serious about training and evaluating the your models, it’s always a good idea to create a dedicated evaluation set. This lets you run more reliable and repeatable experiments. You can either use ner.make-gold or ner.manual, create a new dataset, stream in a sample of your data that you won’t use for training and label all entities. When you run ner.batch-train, you can then set --eval-id [dataset name]. Instead of just holding back a certain percentage from the training set, Prodigy will now evaluate against your evaluation set.

Re: the 99%

This was a customized date label that a regex could probably handle pretty well, so I think its expected that the model would do so well. Which is why I was so confused when it dropped - but maybe its due to having added a lot of ner.teach examples on a different label, so my training set had a lot of examples where the dates were not labeled, even if it did a great job finding them - the evaluation slice would not have dates labeled, that might cause the % to drop since there are no “accept” examples for dates in that part of the dataset.

Re: teach vs. make-gold.

Am I understanding you that I should work with one dataset for the binary ner.teach annotations and a different dataset using make-gold and possibly a third with manual or make-gold for evaluation whose examples aren’t used elsewhere? What about the first ner.teach and ner.make-gold should those also be disjoint sets of examples?

Also when training with multiple sets, do we just do 2 rounds of batch-train? Can I batch-train on two datasets at once?

Thanks for the prompt replies!

Yes, I would at least recommend to use different sets for sparse annotations (binary) and gold-standard annotations, just because it’ll make it easier to use different training strategies later on and keep an overview. And yes, the evaluation set should always be separate. I usually name the evaluation sets something like project_name_eval.

You’d currently have to export the datasets and merge them yourself if you want to train from more than one set at once (or write a script for that). But being able to set more than one dataset ID on the command-line would definitely be a nice feature in the future.

When training new labels, try to always include some examples of (other) entities that the model previously got right. This helps prevent the “catastrophic forgetting” problem where the model overfits on the new data and “forgets” what it previously knew.

The ner.make-gold recipe can be useful here, because it shows you the model’s predictions for the given labels. So you could try a workflow like this: bootstrap a new label with patterns and ner.teach → train and add a new label to the model → load trained model into ner.make-gold with several labels (new one and existing ones) → correct predictions if necessary → train again with --no-missing.

Regarding the workflow for adding a new label…

The model I’m aiming for actually doesn’t need any of the old entities in en_core_web_*. When I batch-train should I be training on top of that model or a new blank model with a blank EntityRecognizer with only my labels? From initial experiments the “Accuracy” listed in batch train is higher when started with a blank model. Also when you say to train again with the gold dataset, I should be training “from scratch” from the base model, right? Or should i be training on top of the trained model I used for make-gold?

Speaking of batch-train accuracy. I ran an experiment where I took my dataset for the three entities/labels I’m training and split them into 3 datasets, one for each label.

When doing that I get the following:
Accuracy 0.814
Accuracy 0.921
Accuracy 0.980

But the model thats trained with all 3 labels at once
Accuracy 0.552

I feel like I’m missing a concept that would explain this, is this the catastrophic forgetting overfit?

Probably a blank model. You might try en_vectors_web_lg. This gives you the pre-trained vectors, but not the model. You can also try outputing an entirely blank model, with something like python -c "import spacy; spacy.blank('en').to_disk('/tmp/en-blank')".

We’re hoping to improve the workflow around this; for now it’s a bit unsatisfying. The software doesn’t fully support having a dataset that’s a mix of the sparsely annotated yes/no type of feedback with another dataset that has fully specified “dense” annotations. I know there’s a mix of tricky concepts here, so to recap:

  • Binary yes/no decisions produce "sparse*, aka “incomplete” annotations, aka some true entities may be missing. We know some span is or is not an entity, but if the model predicts some other entity, we don’t necessarily know whether the model is right or wrong.

  • The manual interface produces “dense”, aka “complete”, aka “no missing” annotations. After you’ve reviewed and selected the text, we can assume that if you didn’t highlight some entity, and the model predicts it, it must be wrong.

As of v1.5.1, if you have a dataset that’s made from a mix of ner.teach and ner.manual (or ner.make-gold), there’s no way to tell it that some of the annotations are complete and others aren’t. You could train on the incomplete annotations, and then train again on top of the output model, as you suggest.

As an alternative, you could instead try to “upgrade” the ner.teach examples, into complete annotations. To do this, run the ner.print-best recipe, which will output the most likely parse compatible with the existing annotations. You can then pipe the output of this into the ner.make-gold recipe, so you can correct the errors and fill in any missing entities.

It wouldn’t be a “catastrophic forgetting” sort of problem, no.

There’s a few possible explanations. It could be that the problem is simply harder initially with the three labels, so the model isn’t optimising well, and the hyper-parameters need some fiddling. Some questions:

  • How many examples do you have?
  • Did you annotate the same examples, or different ones?

I only just saw this post. I didn’t use different datasets and have ner.teach and ner.make-gold annotated data in my one dataset.

Would it make sense to try and do this considering I would have to catch up on about 2 or 3,000 texts I annotated with ner-teach? If so, how exactly do I pipe the ner.print-best to the ner.make-gold?

I tried this, which seems to be wrong since I get an error (ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.):

python3.5 -m prodigy ner.print-best my_dataset /tmp/blank-de | python3.5 -m prodigy ner.make-gold fully_annotated_dataset my_model --label MEDIKAMENT

I created the dataset fully_annotated_dataset beforehand. I can use my already trained model my_model for the ner.make-gold, right? In the ner.make-gold, do I get the error because I didn’t give a source? I thought this ner.print-best would pipe the output directly to the ner.make-gold.

1 Like