Best strategy for training an NER engine

ines · October 25, 2017, 12:33pm

Thanks for your detailed questions and sharing your use case!

No, the active learning component is actually part of the teach process and uses the model you're loading and keeping in the loop, which is then updated with your annotations. The model in the loop learns and improves as you annotate – and when you're done, you can create an even more optimised version of the model you were training using ner.batch-train with more iterations.

I think the problem you're experiencing happens because you start off with a "fresh" model (e.g. the default en_core_web_sm) every time you start the Prodigy server. So on each annotation run, you start with a model that hasn't learned anything yet – which is why it keeps asking the same questions. When you start Prodigy and add to an existing dataset, the existing annotations are not used to pre-train the model. Prodigy only creates a unique hash for each input example and annotated example, to make sure you're not annotating the exact same example twice in the same dataset.

We did consider pre-training on ner.teach in an early version of Prodigy, but decided against this feature, because it would easily lead to weird and unexpected results. Instead, ner.batch-train lets you create trained artifacts of the model, using whichever configuration you like. So for your use case, the steps could look like this:

Start off with a "fresh" model and collect annotations for your dataset using ner.teach.
Run ner.batch-train to train a model and ensure that it's learning what it's supposed to (it's actually really nice to have this in-between step – if it turns out your data is not suitable, or something else is wrong, you'll find out immediately and can make adjustments).
Load the previously trained model into ner.teach, e.g. ner.teach my_set /path/to/model and annotate more examples for the same dataset. You don't need to package your model as a Python package – the model you load in can also be a path.
Train a model again – starting with the "blank" base model, not your previously trained model! – and see if the result improves. Starting with a blank model that hasn't been trained on your examples is important, because you always want a clean state. (You also don't want to end up evaluating your model on examples it has already seen during training.)
Repeat until you're happy with the results.

Using the same dataset to store all annotations related to the same project is definitely a good strategy. It means that every time you train and evaluate your model, it will train and evaluate on the whole set, not just the annotations you collected in the last step. This also helps prevent the "catastrophic forgetting" problem. If you keep updating a previously trained model but only evaluate on the latest collected set, the results may look great on each training run. But you have no way of knowing whether your model still performs well on all examples – or whether it "forgot" everything it previously learned, and now only performs well on the latest set of annotations.

Topic		Replies	Views
ner.train number of examples usage , ner	8	1972	August 3, 2018
questions on Multi NERs Annotation & Training at Once in a Sentence usage , ner , spacy	5	683	October 3, 2022
Span annotation with ner.manual -- how to make use of ner.teach ner	6	879	December 3, 2019
Annotating custom entities in job descriptions usage , custom , hr	9	1202	June 2, 2019
Does Prodigy load pre-annotated data? usage , ner , solved	23	2677	October 25, 2018

Best strategy for training an NER engine

Related topics