Updated model in ner.teach



I am new to prodigy and Spacy and have a few basic questions.

  1. While using ner.teach, I use a standard spacy model like en_core_web_lg to create the annotations data set.The documentation states that the model is updated with the annotation tasks I perform. Does this mean that the downloaded model en_core_web_lg is updated with the new annotations?

  2. If yes, then suppose for the next annotation task for a totally different context, if I wanted to use en_core_web_lg, does it mean it would contain the learnings from step 1?

  3. If no, what do you mean by the base model is updated with the new annotations?

  4. In ner.batch-train, the command line interface specifies that I use an annotated dataset and a base model. The recipe is supposed to create a new output model. Does it use the model algorithm as well as the annotated data set to create the new model?

  5. Can I run ner.batch-train without using a base Model?

(Ines Montani) #2

The model is loaded and updated in memory, but Prodigy won’t just silently overwrite your base model. You always want to be doing this in a separate step, e.g. using a recipe like ner.batch-train.

One reason for that is that Prodigy will only perform one simple update when updating a model in the loop with your annotations. If you do a full training run, you’ll be able to update in multiple iterations and use other tricks like shuffling the data, tuning the hyperparameters etc., which usually gives you much better accuracy. The model you train like this will essentially be a better version of the model Prodigy updated in the loop, since it’ll be trained from the same data.

I hope I understand your question correctly! By default, the ner.batch-train command will take a pre-trained spaCy model and will then use spaCy to update the entity recognizer’s weights. Additionally, Prodigy also includes some logic to perform better updates from binary annotations (e.g. if we only have yes/no annotations for single entity types).

You always need a spaCy model to start with, which includes the basic language data, tokenization rules etc – but you can definitely start with an empty model for any of the 30+ languages supported by spaCy. Here’s an example:

nlp = spacy.blank('en')  # create a blank model
nlp.begin_training()     # initialize model weights


Thank you for the prompt response, Ines. Things are clearer now.