Hi, i hope this message finds you all well!
I've been super happy with the model i've created using Prodigy. My workflow was as follows:
- Create empty dataset
- Teach patterns from seeds (txt vocabularies)
- Export the patterns to jsonl files
- Teach separately for each class (4 labels for my project)
- Train model using train textcat (i was using batch-train but due to deprecation, ended up using train instead)
I taught the model about 4300 annotations in total. In the end, i used this model on a certain dataset to export a CSV, in order to evaluate myself the predictions for each sentence [My current project is text classification on earthquake related twitter data to label the entire tweet as infrastructure, casualties, natural_hazards and feeling_intensity]
The accuracy, given by train textcat, was at 94% and as i was scrolling through the exported CSV, except for some edge-case scenarios and some examples that struggled to pass 50% for a given category, the predictions were really good.
From now on, i'd like to correct my model for some edge case scenarios, but everytime i try to do that, lets say annotate 100 examples for casualties, something else breaks/flactuates.
My understanding is that i should use textcat.teach using the same model that i already annotated to and then train it again. Below is my question:
I have exported the annotations as a backup, in case i want to return to this exact version of the model. Lets say that i've loaded those annotations to clf_model. './classification/' in my case is a pre-trained glove vector on twitter.
prodigy train textcat clf_model ./classification/ --output clf_model --batch-size 10 --n-iter 10
After that, i textcat teach the clf_model with the same pre-trained vector, right? Not with the clf_model spaCy model.
Edit: I probably need a bit more info about the question above. After the train, a spaCy model is created. The next textcat teach i do should be on THAT model, or the gloVe model?
And so, after i teach the model some annotations, do i train it again with the same output or should i keep the previous one, in case something happens?
Sorry for the lengthy post and thanks a lot in advance!