How textcat.teach works under the hood

Hello! I'm a relative novice in prodigy and trying to work on a multi-label text classification model that have exclusive categories. I want to use textcat.teach to seed a classification model for easier training. And I've been referencing the insult classification video as well as this very helpful and reassuring issue. But I have some lingering questions about how prodigy teach works under the hood:

  • The CLI for prodigy textcat.teach allows us to pass in a spaCy model but does not let us pass in a output directory. I had thought that the teach recipe enhances and retrains the spaCy model we pass it (especially from this part of the recipe documentation Updates: spaCy model in the loop, but the fact that there's no model saving output makes me wonder where this retrained model is stored and how it can be re-used for actual training.
  • When it comes to seeding the model, is that a separate process that comes before the textcat.teach classification? I was given the impression by the issue linked above that we need to seed each category with different patterns for multi-label, is that correct?
  • Relatedly, how does seeding the term actually seed the model on the back end? Does it retrain the spaCy model using those seeded terms?
  • Which step should textcat.teach fall into? Should this be when I create a first batch of training data or to update a model that I've trained with textcat.manual? Can I use a local model to do textcat.teach instead of spaCy's pre-set models?

Finally, something unique about my dataset is that I'm working with long-form article data with certain fields that I'm attaching to the main body of the text such as author bio, the general topic category the article falls into. I wonder if this is something I can leverage to enhance the model further.

Sorry about the novel of questions here but I'd really appreciate folks' help answering some of these!

Welcome to the forum @qu-genesis! :wave:

The CLI for prodigy textcat.teach allows us to pass in a spaCy model but does not let us pass in a output directory. I had thought that the teach recipe enhances and retrains the spaCy model we pass it (especially from this part of the recipe documentation Updates: spaCy model in the loop, but the fact that there's no model saving output makes me wonder where this retrained model is stored and how it can be re-used for actual training.

The objective of the model update in teach recipes is to provide more and more relevant suggestions for annotation (more relevant meaning the ones that the model is most uncertain about so they are most informative for the model). That also means that the model is updated in very small increments and never sees the entire dataset, therefore it will always be worse than the model trained properly on the entire dataset. We talk a little bit about in our docs here under "Why do I need to train again after annotating with a model in the loop?"
In other words, the real valuable output from the teach recipes is the annotated dataset not the model you used as an aid to select the examples to annotate.

When it comes to seeding the model, is that a separate process that comes before the textcat.teach classification? I was given the impression by the issue linked above that we need to seed each category with different patterns for multi-label, is that correct?

The seeding with patterns is only used for selecting the examples to annotate. For each example, the recipe generates model's predictions and PatternMatcher matches and combines the results. The patterns matches do not have a direct effect on the model training. They will only have an indirect effect by influencing the choice of the examples to annotate.
As for the model categories and label categories: technically the labels used in patterns file do not have any effect on the model. Only the fact that there's a match matters. Nonetheless, I would recommend using the same categories as in the model just for clarity on how representative your pattern set is and easier debugging and possibly reusing patterns for certain categories for additional annotation rounds.

Relatedly, how does seeding the term actually seed the model on the back end? Does it retrain the spaCy model using those seeded terms?

As I mentioned above, the predictions from the model and the PatternMatcher are just combined - this is mostly useful at the beginning to alleviate the cold start problem. Our docs on custom textcat models show in detail how the models are being combined, but the gist of this is:

# In combine_models:
stream1 = one_predict(iter(batch))  # textcat predictions
stream2 = two_predict(iter(batch))  # PatternMatcher predictions
yield from interleave((stream1, stream2))

The PatternMatcher scores each pattern based on the number of hits and this number is eventually used in the sampling method of the textcat.teach recipe. You can find more details on how the score for patterns is computed here under the "How does the confidence scoring work, and how can I use the priors?"

Which step should textcat.teach fall into? Should this be when I create a first batch of training data or to update a model that I've trained with textcat.manual? Can I use a local model to do textcat.teach instead of spaCy's pre-set models?

It is usually more efficient to do some manual annotations first so that the model used in the textcat.teach does not have to start from scratch and you can obtain meaningful suggestions from the get go. So yes, textcat.manual followed by textcat.teach is usually the right workflow. And of course, you can specify your custom model for textcat.teach. One way is to provide the path to where you've saved the trained spaCy pipeline a spacy-model argument.

Finally, something unique about my dataset is that I'm working with long-form article data with certain fields that I'm attaching to the main body of the text such as author bio, the general topic category the article falls into. I wonder if this is something I can leverage to enhance the model further.

I think attaching the topic should help. The most straightforward thing to do would be just to concatenate it (as you did) and see if the model pics up on it.
You might consider providing some markers within the text such as so that you can easily strip it away for experimentation.
Another option would be try some feature engineering and create separate embeddings for the article, bio and topic and use them as separate inputs or concatenate them before feeding into the model, but that of course would require implementing a custom textcat model architecture and I doubt it's worth it, but it's definitely something you could experiment with.
You could also consider using a multi-task learning approach where you train your model to predict not only the text categories but also the topic category. This can help the model learn more robust representations.
Before venturing into something more complex architectures, I would definitely try to get a baseline on articles only and then just concatenate topic and bio and see if that makes a difference using spaCy default textcat architecture.