Work Flow for extending an NER model with new entity types

I think overall your workflow sounds right. I know there’s a lot of steps involved — I hope we can continue streamlining this.

For simplicity of discussion, let’s say we start off with a spaCy model that can predict entities A and B (in the en_core_web_lg model we predict 20 entities, but it’s easier to talk about 2). We’ll call this initial model en_ner_ab. Our goal is to get to a model en_ner_abcd. To do that, we need to annotate text with our new entities, C and D.

Let’s add some more notation, and call a piece of text with automatic annotations for entity-types A and B text_autoAB. If the text has “gold” annotations (i.e. annotations that are manually reviewed and assumed correct), we’ll write it text_goldAB. Text might have a mix of automatic and gold annotations for different entity types which we’d write text_autoAB_goldCD.

One strategy for the annotation project might be to manually annotate C and D separately, use the model to predict the pretrained entities, and then correct the results:

  1. text
  2. text_goldC
  3. text_goldD
  4. `text_autoAB_goldCD
  5. text_goldABCD

You’d use ner.manual for steps 2 and 3, and then write a script to predict the entities and merge the annotations to do step 4. Then you’d use ner.manual for step 5.

Another way to go about things would be to skip straight from 1 to 5, just using ner.manual with lots of entities. The advantage of this approach is you only make one pass over the text, but the disadvantage is you have a lot of entities in your UI. Which strategy works better depends on characteristics of your problem, including:

  • How accurate is the existing model?
  • How many existing entity types do you need to reuse?
  • How many entity types do you need to add?
  • How “dense” are the entity annotations? If there are few entities per text, it’s more efficient to do more in a single pass.

To answer your question about catastrophic forgetting, let’s imagine training as a function call train() that takes two arguments: an initial model, and some annotations. The model we output will have the superset of entities in the initial model and the data we annotate. The catastrophic forgetting problem may occur if we do something like this: en_ner_abcd = train(en_ner_ab, text_goldCD). To avoid this, we can try: en_ner_abcd = train(en_ner_abcd, text_autoAB_goldCD) or en_ner_abcd = train(en_blank, text_autoAB_goldCD). It’s hard to be sure which is better. If the dataset for CD is small, starting from a blank model may discard too much information. On the other hand, training from an initial model is harder to reason about, and may be more sensitive to hyper-parameters. It’s tough to give advice that’s much more useful than “try it out and see which looks better”.

The latest version of Prodigy does take advantage of improvements to model resuming in spaCy v2.1. So, starting from non-blank models, or workflows like en_ner_abcd = train(en_ner_ab, text_goldCD) might work better than they used to. If feasible, I would usually recommend you do data merging and get to text_goldABCD via the ner.make-gold recipe, though. The correction step probably isn’t that expensive, and you basically have a choice between spending time improving the data, and spending time faffing about with the training process. Faffing about with the training is arguably more fun, but the effort isn’t as reuseable as investment in the data.

In summary, here’s what I would try first if you only have a couple of new entities to annotate

  1. Create train, test and dev text partitions.
  2. Create a subpartition of the training text, “trial”, to annotate entirely first.
  3. Run one ner.manual task per entity type. One annotator per entity type. All annotations over your trial partition.
  4. Use en_core_web_lg to predict the existing entities over the trial text.
  5. Merge the automatic annotations and the manual annotations, so that the trial text is exhaustively annotated.
  6. Use ner.make-gold to correct the annotations manually. You probably want duplicate annotations for this – so both annotators should do the whole dataset, separately.
  7. Find conflicts, resolve them. One way is to go over them together, another is to do them separately and meet afterwards to see where you both disagree. The second way is more expensive, but sometimes leads to better insights.