Model in the loop behaviour

Hi,

I have some questions regarding the "model in the loop":

  • Where is the model stored?
  • Is it possible to train a new model starting with the previously trained model in the loop?
  • I assume whenever a new model is trained the previous model in the loop is discarded, is this correct?
  • Can we filter the data used for training? How could I implement filtering by session_id in the train recipe?

Thanks in advance!

Hi! My answers below assume that this is about the built-in workflows for training with a spaCy model in the loop. Of course, you could also use Prodigy with your own custom models in the loop using custom recipes.

The model in the loop can be any loadable spaCy model and it will be kept in memory after it's loaded. The updated model is temporary and only kept for annotation, and it's discarded when you exit the server.

Yes, but in that case, you typically want to pre-train a model from scratch using the annotations you previously collected, and then use that as the base model when you re-start the server. This will give you pretty much the same model you had in the loop, but more stable/accurate because you can train mode effectively, make multiple passes over the data, shuffle it etc. You can also read more about it here: https://prodi.gy/docs/named-entity-recognition#active-learning-batch-train

The stream generator is a simple Python generator, so there's currently no way for the server to pass back the active session ID when requesting a new batch of examples. So if you need more flexible control over what to send out based on who is annotating, it's probably easiest to start multiple instance of Prodigy on different ports that write to different datasets, and then use that in your stream logic.

If you're using a model in the loop, we also typically recommend limiting the number of annotators on the same instance or even giving each annotator its own model instance. Otherwise, it may be a lot less effective. In an ideal scenario, all your annotators would be making similar decisions and moving the model in the same direction. But even small differences can potentially have an impact and make the model less useful, and give you less relevant suggestions.

1 Like

Thanks a lot for your answer!
However, if the model is kept in memory, does that mean it is not necessary to train the model before using "teach"? Specifically, does "teach" include training the model based on the previous annotations?

No, that's something you should be doing in a previous step (e.g. using train with --binary), which will also give you much more control over the training process. The model you export can then be loaded into ner.teach as the base model.

1 Like