Work Flow for extending an NER model with new entity types

Hello,
I’m new to Prodigy, and thank you for making this tool. I have a question about the workflow for training the NER model. So, I have myself and another annotator. I’m thinking of splitting up our examples text file 50/50 between us. Then, annotate new entities using ner.manual with the en_core_web_lg pretrained model. Afterwards, I plan on using db-merge to merge his dataset with mine and train a new model using ner.batch-train. The reason we have to get a new model is because in the following step, the model used in ner.train only takes entities it has seen. We will then use ner.train for entities like PERSON,TIME,DATE plus the new entities(to prevent catastrophic forgetting) on the new model to automate the annotation process ,again using the 50/50 split in our source txt file. Finally, we will merge again and call ner.batch-train.

I guess my questions for this process is this:

  1. Did I properly account for catastrophic forgetting with regard to my workflow?
  2. Is this task better suited for using a blank model from the beginning?
  3. How and does Prodigy take care of possible duplicate annotations between myself and the other annotator when we use ner.train or ner.manual?
    Please let me know if this workflow makes sense to you or if have any suggestions on alternatives methods/improvements.
    Thanks

I think overall your workflow sounds right. I know there’s a lot of steps involved — I hope we can continue streamlining this.

For simplicity of discussion, let’s say we start off with a spaCy model that can predict entities A and B (in the en_core_web_lg model we predict 20 entities, but it’s easier to talk about 2). We’ll call this initial model en_ner_ab. Our goal is to get to a model en_ner_abcd. To do that, we need to annotate text with our new entities, C and D.

Let’s add some more notation, and call a piece of text with automatic annotations for entity-types A and B text_autoAB. If the text has “gold” annotations (i.e. annotations that are manually reviewed and assumed correct), we’ll write it text_goldAB. Text might have a mix of automatic and gold annotations for different entity types which we’d write text_autoAB_goldCD.

One strategy for the annotation project might be to manually annotate C and D separately, use the model to predict the pretrained entities, and then correct the results:

  1. text
  2. text_goldC
  3. text_goldD
  4. `text_autoAB_goldCD
  5. text_goldABCD

You’d use ner.manual for steps 2 and 3, and then write a script to predict the entities and merge the annotations to do step 4. Then you’d use ner.manual for step 5.

Another way to go about things would be to skip straight from 1 to 5, just using ner.manual with lots of entities. The advantage of this approach is you only make one pass over the text, but the disadvantage is you have a lot of entities in your UI. Which strategy works better depends on characteristics of your problem, including:

  • How accurate is the existing model?
  • How many existing entity types do you need to reuse?
  • How many entity types do you need to add?
  • How “dense” are the entity annotations? If there are few entities per text, it’s more efficient to do more in a single pass.

To answer your question about catastrophic forgetting, let’s imagine training as a function call train() that takes two arguments: an initial model, and some annotations. The model we output will have the superset of entities in the initial model and the data we annotate. The catastrophic forgetting problem may occur if we do something like this: en_ner_abcd = train(en_ner_ab, text_goldCD). To avoid this, we can try: en_ner_abcd = train(en_ner_abcd, text_autoAB_goldCD) or en_ner_abcd = train(en_blank, text_autoAB_goldCD). It’s hard to be sure which is better. If the dataset for CD is small, starting from a blank model may discard too much information. On the other hand, training from an initial model is harder to reason about, and may be more sensitive to hyper-parameters. It’s tough to give advice that’s much more useful than “try it out and see which looks better”.

The latest version of Prodigy does take advantage of improvements to model resuming in spaCy v2.1. So, starting from non-blank models, or workflows like en_ner_abcd = train(en_ner_ab, text_goldCD) might work better than they used to. If feasible, I would usually recommend you do data merging and get to text_goldABCD via the ner.make-gold recipe, though. The correction step probably isn’t that expensive, and you basically have a choice between spending time improving the data, and spending time faffing about with the training process. Faffing about with the training is arguably more fun, but the effort isn’t as reuseable as investment in the data.

In summary, here’s what I would try first if you only have a couple of new entities to annotate

  1. Create train, test and dev text partitions.
  2. Create a subpartition of the training text, “trial”, to annotate entirely first.
  3. Run one ner.manual task per entity type. One annotator per entity type. All annotations over your trial partition.
  4. Use en_core_web_lg to predict the existing entities over the trial text.
  5. Merge the automatic annotations and the manual annotations, so that the trial text is exhaustively annotated.
  6. Use ner.make-gold to correct the annotations manually. You probably want duplicate annotations for this – so both annotators should do the whole dataset, separately.
  7. Find conflicts, resolve them. One way is to go over them together, another is to do them separately and meet afterwards to see where you both disagree. The second way is more expensive, but sometimes leads to better insights.