Updated model in ner.teach

Hi,
I am new to prodigy and Spacy and have a few basic questions.

  1. While using ner.teach, I use a standard spacy model like en_core_web_lg to create the annotations data set.The documentation states that the model is updated with the annotation tasks I perform. Does this mean that the downloaded model en_core_web_lg is updated with the new annotations?

  2. If yes, then suppose for the next annotation task for a totally different context, if I wanted to use en_core_web_lg, does it mean it would contain the learnings from step 1?

  3. If no, what do you mean by the base model is updated with the new annotations?

  4. In ner.batch-train, the command line interface specifies that I use an annotated dataset and a base model. The recipe is supposed to create a new output model. Does it use the model algorithm as well as the annotated data set to create the new model?

  5. Can I run ner.batch-train without using a base Model?

1 Like

The model is loaded and updated in memory, but Prodigy won't just silently overwrite your base model. You always want to be doing this in a separate step, e.g. using a recipe like ner.batch-train.

One reason for that is that Prodigy will only perform one simple update when updating a model in the loop with your annotations. If you do a full training run, you'll be able to update in multiple iterations and use other tricks like shuffling the data, tuning the hyperparameters etc., which usually gives you much better accuracy. The model you train like this will essentially be a better version of the model Prodigy updated in the loop, since it'll be trained from the same data.

I hope I understand your question correctly! By default, the ner.batch-train command will take a pre-trained spaCy model and will then use spaCy to update the entity recognizer's weights. Additionally, Prodigy also includes some logic to perform better updates from binary annotations (e.g. if we only have yes/no annotations for single entity types).

You always need a spaCy model to start with, which includes the basic language data, tokenization rules etc – but you can definitely start with an empty model for any of the 30+ languages supported by spaCy. Here's an example:

nlp = spacy.blank('en')  # create a blank model
nlp.begin_training()     # initialize model weights
nlp.to_disk('/path/to/blank_model') 

Thank you for the prompt response, Ines. Things are clearer now.

@ines

  1. is the model loaded and updated in memory every time a user hits save or if it reaches the batch size defined in configuration ?
  2. Does the annotation queue gets updated too after every model update(because the uncertainity score will change after new information is added to the model)?

Yes – every time a batch of new answers is sent back to the server, the recipe's update callback is triggered and the model is updated.

Yes, the stream (annotation queue) is just a Python generator that yields out examples, and Prodigy consumes them in batches. This means it can respond to arbitrary state changes. The examples are processed with the model within the stream generator – so if the model changes, this will be reflected in the next batch that's sent out for annotation.

You can see an example of this in this example recipe script. It's a text classification example and uses a dummy model that "predicts" random numbers to illustrate the idea of updating a model in the loop and scoring new examples with respect to the updated weights.

@ines thanks, it makes sense now.