Textcat teach after training to better converge model's decisions

Hi, i hope this message finds you all well!
I've been super happy with the model i've created using Prodigy. My workflow was as follows:

  1. Create empty dataset
  2. Teach patterns from seeds (txt vocabularies)
  3. Export the patterns to jsonl files
  4. Teach separately for each class (4 labels for my project)
  5. Train model using train textcat (i was using batch-train but due to deprecation, ended up using train instead)

I taught the model about 4300 annotations in total. In the end, i used this model on a certain dataset to export a CSV, in order to evaluate myself the predictions for each sentence [My current project is text classification on earthquake related twitter data to label the entire tweet as infrastructure, casualties, natural_hazards and feeling_intensity]
The accuracy, given by train textcat, was at 94% and as i was scrolling through the exported CSV, except for some edge-case scenarios and some examples that struggled to pass 50% for a given category, the predictions were really good.
From now on, i'd like to correct my model for some edge case scenarios, but everytime i try to do that, lets say annotate 100 examples for casualties, something else breaks/flactuates.
My understanding is that i should use textcat.teach using the same model that i already annotated to and then train it again. Below is my question:

I have exported the annotations as a backup, in case i want to return to this exact version of the model. Lets say that i've loaded those annotations to clf_model. './classification/' in my case is a pre-trained glove vector on twitter.
prodigy train textcat clf_model ./classification/ --output clf_model --batch-size 10 --n-iter 10
After that, i textcat teach the clf_model with the same pre-trained vector, right? Not with the clf_model spaCy model.
Edit: I probably need a bit more info about the question above. After the train, a spaCy model is created. The next textcat teach i do should be on THAT model, or the gloVe model?
And so, after i teach the model some annotations, do i train it again with the same output or should i keep the previous one, in case something happens?

Sorry for the lengthy post and thanks a lot in advance!

Thanks, that's nice to hear! :blush:

If you want to improve an existing model (like clf_model that you just trained), you should use that model with textcat.teach. This will show you suggestions made by the updated model, so you don't have to start again from zero.

Since your clf_model is trained based on the GloVe vector model (./classification), the exported directory will include those vectors as well.

We typically recommend training from scratch with all annotations – there's not really an advantage in updating the same artifact, and it can easily lead to unintended side-effects, forgetting effects etc. That said, it's always a good idea to keep separate datasets for the different experiments so you can always reconstruct the intermediate model artifacts or start over if you make a mistake.

For example, in your first annotation run, you've saved your annotations to the dataset named clf_model and train from that. When you start textcat.teach again, you can save your annotations to a dataset clf_model_improved and run prodigy train textcat clf_model,clf_model_improved ./classification/ ... to train a new model using all annotations and your base vectors.

Now if you make a mistake in the second annotation run and want to start over, you can just delete the clf_model_improved dataset and start again. Or if you want to re-create the first model you trained, you can just train from clf_model again.

1 Like