Quick newbie training question.
I have created industry-specific word2vec vectors and initialized an empty model. I collected annotations with ner.manual and pattern matching to get things going. I trained a new model with those annotations and then used ner.teach to annotate another 5000 entities.
I am correct in assuming that when ner.teach is updating the model that it is in effect the same as prodigy train command? I am thinking that if I train the model on the new annotations I would be training the model on the same data twice which sounds like a bad idea. Not completely sure about the pathway for updating a model that is already working pretty good.
Also, my empty model initialized from the word2vec vectors does not have a sentanceizer. How can I make a blank model from vectors and also have a sentanceizer.
Hi! The updating performed in the loop is pretty much the same: in both cases, you're calling nlp.update with a batch of examples to update the model. However, when you run ner.teach, that model in the loop is discarded afterwards and it's not just silently overwriting your model. So you should retrain your model with train (and the --binary flag to tell Prodigy that the annotations are binary yes/no annotations).
This section in the docs explains the logic behind this in more detail:
Why do I need to train again after annotating with a model in the loop?
When you annotate with a model in the loop, the model is also updated in the background. So why do you still need to train your model on the annotations afterwards, and can’t just export the model that was updated in the loop? The main reason is that the model in the loop is only updated once each new annotation. This is never going to be as effective as batch training a model on the whole dataset, making multiple passes over the data, shuffling on each epoch and using other deep learning tricks like dropout rates, compounding batch sizes and so on. If you batch train your model with the collected annotations afterwards , you should receive the same model you had in the loop, just better .