Training/Evaluation dilemma

If you have all the training data available, you usually want to start from a blank (or vectors) model, rather than using one you’ve previously trained, and training on top of it.

The mechanics of whether the data will be “double counted” are a little subtle, though. In theory you’ll probably converge to a similar solution, even if you do start with the previous model. To understand why this is the case, remember that the model you load in, that you previously trained on your first 1000 texts, is going to have pretty low training loss on those texts. If it’s getting those examples mostly right, then it won’t update much against them. The the total magnitude of the updates you’re making will be dominated by the new texts initially.

That said: it’s simply much more difficult to reason about the performance and training dynamics if you start from an intermediate state, rather than starting from the random initialisation. You also don’t save much time doing that, so there’s little advantage. The only time you want to update on top of existing weights is if you don’t have access to the initial training data (e.g. with the en_core_web_lg etc spaCy models, which are trained on proprietary datasets we can’t distribute), or when the initial training took a very long time (e.g. with language model pretraining).

The other consideration is simple repeatability. If you’re always training on top of a prior model, then it’ll be really hard to start from scratch and reproduce the same result. You would have to first train from 1000, save that out, and then train from the full 1500. Being able to repeat your work is obviously good for sanity.

The other time you need to use a non-blank model is in commands like ner.teach and ner.make-gold. But notice that these recipes are designed to produce annotations, they’re not designed to produce models. It makes sense to start from a non-blank model if the goal is to do model-assisted annotation. But if you just want to output a model, you want to start from an initial condition that’s easy to reason about, so you don’t want to resume from an arbitrary model.

For future readers, @ines’s reply on a related thread might also be worth reading: Model Training for NER