textcat.teach model init: db-based or session-only?

Hello there!

For each label I am currently:

  • bootstrapping the model with patterns
  • running textcat.teach with prefer_high_scores sorter and annotating until the progress bar shows around 90 % (usually I need something over 1000 examples)
  • running textcat.batch-train that typically achieves around 75 % F-score

At this point, I would like to boost the preformance by adding additional examples using textcat.teach and prefer_uncertain sorter. (Hopefully, this workflow is sensible, or should I rather be focusing at hyperparameter tuning?) However, when I start textcat.teach again, based on the progress bar, it seems that the model in the loop is only trained based on the actual session.

Is there any way how to initialise the model in the loop based on all the examples in the db?

This is correct, it always starts from the base model – because otherwise, we'd essentially have to run textcat.batch-train under the hood before each annotation session. So instead of wrapping that in textcat.teach, you can just run that step yourself with the setting you need, pre-train your model and then use that artifact as the base model.

So when you run textcat.teach for the second time, you can pass in the path to the model you trained with textcat.batch-train instead of the base model (en_core_web_sm etc.).

1 Like

Excellent! Thanks for the clarification!

Is there actually any benefit to usingprefer_uncertain on textcat.batch-train models trained for each label separately in comparison to usingprefer_uncertain on textcat.batch-trainmodel trained on a merged dataset of all the labeled example?

My expectation is that the uncertain cases suggested for annotation should be the same, but I might be missing something...