I have a general question about Prodigy best practices. I bootstrapped a spaCy model using raw data and ner.manual command to train a model on top of en_core_web_md to detect a custom entity.
I evaluated my model and found some errors, so I collected more data and now want to update the old model with this new data. What is the best practice here? Should I use ner.correct or ner.manual - is the only difference that one suggests annotations to speed up time?
Also before training with prodigy train ner should I combine the two datasets into one to prevent anything like the catastrophic forgetting problem? Thanks!
Yes, exactly – and it makes it easy to include everything else that the model previously predicted. So if you have a model that predicts something relevant, using ner.correct makes more sense IMO because it saves you time and lets you review the model's predictions at the same time.
Yes, if you have all annotations, it makes sense to train from scratch / from the base model every time instead of "incrementally" updating the same artifact. It helps prevent catastrophic forgetting and other side-effects, and also gives you a more stable evaluation. (If you have a separate evaluation dataset, you can always evaluate the same process on the same data but you couldn't really compare the results unless you always start with the same base model.)
The train command lets you specify a comma-separate list of datasets btw and will take care of merging the annotations. So you can run: prodigy train ner dataset_one,dataset_two ...
So as a follow up question when you say it makes sense to train from scratch / from the base model every time instead of "incrementally" updating the same artifact.
If my model is based off of en_core_web_md, I would always want to train off of this base model.
Your suggestion:
Train en_core_web_md with entity type X
Collect more data, Train en_core_web_md again with entity type X and entity type Y
Vs. my original thought:
Train en_core_web_md with entity type X -> en_core_web_md_with_x
Collect more data and train en_core_web_md_with_x with entity type X and entity type Y
At what point would I not want to always train from scratch - when it becomes too slow?
That shouldn't be a problem – even if it's slower, it would normaly be worth it. If you have access to all the data, it makes sense to train with the whole corpus, from scratch.
Starting with an existing model makes sense if you don't have access to the whole original training data or training process – like with the en_core_web_sm model. We can distribute the model, but not the data, and if you don't have a license for it, you couldn't just train it from scratch. So if you just want to adjust its predictions on your data, it can be easier and more efficient to just take the existing model and update it with a few more examples.