Best strategy for training an NER engine

Thanks for your detailed questions and sharing your use case!

No, the active learning component is actually part of the teach process and uses the model you're loading and keeping in the loop, which is then updated with your annotations. The model in the loop learns and improves as you annotate – and when you're done, you can create an even more optimised version of the model you were training using ner.batch-train with more iterations.

I think the problem you're experiencing happens because you start off with a "fresh" model (e.g. the default en_core_web_sm) every time you start the Prodigy server. So on each annotation run, you start with a model that hasn't learned anything yet – which is why it keeps asking the same questions. When you start Prodigy and add to an existing dataset, the existing annotations are not used to pre-train the model. Prodigy only creates a unique hash for each input example and annotated example, to make sure you're not annotating the exact same example twice in the same dataset.

We did consider pre-training on ner.teach in an early version of Prodigy, but decided against this feature, because it would easily lead to weird and unexpected results. Instead, ner.batch-train lets you create trained artifacts of the model, using whichever configuration you like. So for your use case, the steps could look like this:

  1. Start off with a "fresh" model and collect annotations for your dataset using ner.teach.
  2. Run ner.batch-train to train a model and ensure that it's learning what it's supposed to (it's actually really nice to have this in-between step – if it turns out your data is not suitable, or something else is wrong, you'll find out immediately and can make adjustments).
  3. Load the previously trained model into ner.teach, e.g. ner.teach my_set /path/to/model and annotate more examples for the same dataset. You don't need to package your model as a Python package – the model you load in can also be a path.
  4. Train a model again – starting with the "blank" base model, not your previously trained model! – and see if the result improves. Starting with a blank model that hasn't been trained on your examples is important, because you always want a clean state. (You also don't want to end up evaluating your model on examples it has already seen during training.)
  5. Repeat until you're happy with the results.

Using the same dataset to store all annotations related to the same project is definitely a good strategy. It means that every time you train and evaluate your model, it will train and evaluate on the whole set, not just the annotations you collected in the last step. This also helps prevent the "catastrophic forgetting" problem. If you keep updating a previously trained model but only evaluate on the latest collected set, the results may look great on each training run. But you have no way of knowing whether your model still performs well on all examples – or whether it "forgot" everything it previously learned, and now only performs well on the latest set of annotations.

3 Likes