Advice on training NER models with new entities

Yes, that's correct!

(One quick note on vectors: If you do end up with good vectors for your domain, using them in the base model can sometimes improve accuracy. If you're training a spaCy model and vectors are available in the model, they'll be used during training.)

Yes, that's correct. ner.make-gold can only pre-highlight entities that are predicted by the model, so this only works if the model already knows the entity type. If some of your entitiy types are already present in the model and others aren't, you could also combine the two recipes: start by annotating the existing labels with ner.make-gold, export the data, load it into ner.manual and add the new labels on top. How you do that depends on what's most efficient for your use case.

Training and evaluation examples should ideally be drawn from the same data source, yes. The examples should also be representative of what your model will see at runtime – for example, if you're processing short paragraphs at runtime, you also want to evaluate the model on short paragraphs (and not, say, short sentences only). Also double-check that there's no overlap between the training and evaluation examples – even single examples can often lead to pretty distorted results. 20-50% of the number of training examples is usually a good amount – if you have under a thousand examples for evaluation, you might have to take the evaluation results with a grain of salt.

1 Like