How to modify labels/entities in default models (en, en_core_web_lg, etc) and retrain

Still on Christmas holidays, so we're not always online – happy holidays btw! :gift:

There are basically two possible paths here:

  1. Add more entity types to the existing model, e.g. start with a pre-trained model and update it with examples of the new entity types and some examples of the existing entity types (to prevent the model from "forgetting" them). In Prodigy, you could use a recipe like ner.make-gold to correct the model's predictions and add your new entity types manually.
  2. Start off with a blank model / a blank entity recognizer and train it from scratch with examples of all entity types you're interested in. In Prodigy, you could start with ner.manual and label everything from scratch, or use ner.teach or ner.match with patterns that describe the entities, to make it easier to get over the cold start problem and label faster by accepting/rejecting. You could also use ner.make-gold here btw with the labels you want to keep (faster, because the model will highlight them and you only need to correct the entities), add you new labels and then train a new model from scratch.

You might have to try both approaches to find out what works best for your use case. If your new categories overlap with categories the model previously predicted, or if you want to train a lot of new stuff in general, it's often not worth it to mess with the pre-trained models. You might have to change pretty much all the weights to teach it the new definitions, and you might end up with all kinds of confusing side-effects due to the existing weights. So it's often easier to start with a blank model and fresh annotations.

Have you tried the solution suggested in the error message? As the message says, the model needs to be able to split sentences, but it currently doesn't set sentence boundaries (because it has no parser and no other component for sentence boundary detection).

The recipe you're running will split the text into sentences (unless you're running it with --unsegmented), but the model you're loading in can't do this, so spaCy complains. To fix this, you can add the sentencizer, a pipeline component that does simple rule-based sentence segmentation. Just make sure you add it before you save out the model:

sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
1 Like