Pattern files for textcat.teach

Thanks for the thorough reply! Hopefully this is the final set of clarifications to be able to move on without patterns.

I tried to reproduce this, and wasn't able to get the pattern match to work, but given that the task comes up differently each time, I do believe that given enough runs of the task I might wind up with the results you saw.

This makes sense to me, and I've seen this suggested elsewhere in the support boards. Can I clarify that the steps below are what you are imagining?

prodigy dataset insult_bootstrap
grep -i -E 'list|of|insult|words' inputfile.jsonl | prodigy mark insult_bootstrap --label INSULT --view-id classification
prodigy textcat.batch-train insult_bootstrap en_core_web_lg --output insult_bootstrap_model

Now I have a a model that's pre-trained on the insult label words. This will be used for a regular training round:

prodigy dataset insult
prodigy textcat.teach insult ./insult_bootstrap_model --label INSULT
prodigy textcat.batch-train insult ./insult_bootstrap_model --output insult_model

Does this look right? A few clarifications:

  1. Is the idea correct to filter down to more likely correct cases, annotate on a dedicated label, and then export a trained model? Or is this really a more simple process where I can do this entire process within one dataset?
  2. Should I have imported the insult_bootstrap dataset into the insult dataset to train the actual/final model on those annotations also?
  3. Should bot the textcat.teach and textcat.batch-train use the exported insult_bootstrap_model as their starting model?

Thanks again for your help with this!