Training/Evaluation dilemma


I have some confusion with regards to the mechanics of how Prodigy handles the annotation database and batch-training.

Basically, I have annotated about 1000 annotations on my dataset on a blank model today, and I have batch-trained with the annotations to get a trained model. Problem is, if I wish to continue annotating for another 500 annotations tomorrow, is Prodigy’s database constructed in a way it recognises the 500 annotations are a fresh set of annotations?

If it doesn’t, then if I were to batch-train these fresh 500 annotations on the trained model from yesterday, will it be biased (because you are training the old 1000 annotations again, i.e double counting)?

Also, I’m getting much higher accuracy training the fresh set of annotations on trained model, instead of training on a blank model everytime I have new annotations. Is that high accuracy due to how the evaluation set is taken? (i.e the eval set is taken from the old annotations which the trained model from yesterday has already seen…)

Kindly advice, thanks!

If you have all the training data available, you usually want to start from a blank (or vectors) model, rather than using one you’ve previously trained, and training on top of it.

The mechanics of whether the data will be “double counted” are a little subtle, though. In theory you’ll probably converge to a similar solution, even if you do start with the previous model. To understand why this is the case, remember that the model you load in, that you previously trained on your first 1000 texts, is going to have pretty low training loss on those texts. If it’s getting those examples mostly right, then it won’t update much against them. The the total magnitude of the updates you’re making will be dominated by the new texts initially.

That said: it’s simply much more difficult to reason about the performance and training dynamics if you start from an intermediate state, rather than starting from the random initialisation. You also don’t save much time doing that, so there’s little advantage. The only time you want to update on top of existing weights is if you don’t have access to the initial training data (e.g. with the en_core_web_lg etc spaCy models, which are trained on proprietary datasets we can’t distribute), or when the initial training took a very long time (e.g. with language model pretraining).

The other consideration is simple repeatability. If you’re always training on top of a prior model, then it’ll be really hard to start from scratch and reproduce the same result. You would have to first train from 1000, save that out, and then train from the full 1500. Being able to repeat your work is obviously good for sanity.

The other time you need to use a non-blank model is in commands like ner.teach and ner.make-gold. But notice that these recipes are designed to produce annotations, they’re not designed to produce models. It makes sense to start from a non-blank model if the goal is to do model-assisted annotation. But if you just want to output a model, you want to start from an initial condition that’s easy to reason about, so you don’t want to resume from an arbitrary model.

For future readers, @ines’s reply on a related thread might also be worth reading: Model Training for NER

Hi @honnibal, thanks for the prompt reply!

I’m raising this up because I was training my 1500 annotations on a blank model and got around 30+% accuracy, which is extremely poor. I also tried turning on the --unsegmented parameter, and got roughly the same accuracy.

On the other hand, another approach was first training the blank model on my first 1000 annotations with -U parameters turned on, and after some ner.make-gold corrections, I got about 60% accuracy. Therafter, I used the newly annotated 500 annotations and trained upon this 60% model, and got a 85% accuracy model (which came as a pleasant surprise…).

So I was wondering if the high accuracy is flawed or biased, in a sense that when Prodigy did the 80, 20 splitting of train/eval, the eval set was actually part of the 1000 annotations that the model has seen before. Hope to clarify the discrepancies in these 2 approaches as well…


@jsnleong Yes that’s another very good reason not to train in multiple steps like that, actually. If you’re using the split evaluation rather than keeping a consistent evaluation set, then you’ll end up forming a different train/evaluation split each time you run ner.batch-train, which means the model you’re loading in may have seen some of the data you’re evaluating on.

So: the 85% accuracy is probably not a correct evaluation figure. It’s best to just train from a blank or vectors-only model each time. If you’re only getting 30% accuracy, I would check the following:

  • If you’re not using word vectors, consider trying that. You can start from the en_vectors_web_lg model, for instance.
  • You can try different hyper-parameters, e.g. changing the batch size and the dropout rate
  • You could check that your data is consistently annotated
  • You could try annotating more data

Ok thanks @honnibal.

Currently, I’m using the en_core_web_lg model, but replacing the ‘ner’ component with a blank one. Is that similar to using en_vectors_web_lg?

I still tried installing that vectors model and run it on my annotations. I’m getting a ValueError: [E030] Sentence boundaries unset.

However, when I use the --unsegmented argument, I am able to get it running and getting an accuracy of 59.7% instead of ~43% if i’m using the en_core_web_lg model (blank ner).

What seems to be happening when I use -U?