How textcat.teach works under the hood

qu-genesis · February 18, 2025, 4:59pm

Hello! I'm a relative novice in prodigy and trying to work on a multi-label text classification model that have exclusive categories. I want to use textcat.teach to seed a classification model for easier training. And I've been referencing the insult classification video as well as this very helpful and reassuring issue. But I have some lingering questions about how prodigy teach works under the hood:

The CLI for prodigy textcat.teach allows us to pass in a spaCy model but does not let us pass in a output directory. I had thought that the teach recipe enhances and retrains the spaCy model we pass it (especially from this part of the recipe documentation Updates: spaCy model in the loop, but the fact that there's no model saving output makes me wonder where this retrained model is stored and how it can be re-used for actual training.
When it comes to seeding the model, is that a separate process that comes before the textcat.teach classification? I was given the impression by the issue linked above that we need to seed each category with different patterns for multi-label, is that correct?
Relatedly, how does seeding the term actually seed the model on the back end? Does it retrain the spaCy model using those seeded terms?
Which step should textcat.teach fall into? Should this be when I create a first batch of training data or to update a model that I've trained with textcat.manual? Can I use a local model to do textcat.teach instead of spaCy's pre-set models?

Finally, something unique about my dataset is that I'm working with long-form article data with certain fields that I'm attaching to the main body of the text such as author bio, the general topic category the article falls into. I wonder if this is something I can leverage to enhance the model further.

Sorry about the novel of questions here but I'd really appreciate folks' help answering some of these!

magdaaniol · February 20, 2025, 11:56am

Welcome to the forum @qu-genesis!

The CLI for prodigy textcat.teach allows us to pass in a spaCy model but does not let us pass in a output directory. I had thought that the teach recipe enhances and retrains the spaCy model we pass it (especially from this part of the recipe documentation Updates: spaCy model in the loop, but the fact that there's no model saving output makes me wonder where this retrained model is stored and how it can be re-used for actual training.

The objective of the model update in teach recipes is to provide more and more relevant suggestions for annotation (more relevant meaning the ones that the model is most uncertain about so they are most informative for the model). That also means that the model is updated in very small increments and never sees the entire dataset, therefore it will always be worse than the model trained properly on the entire dataset. We talk a little bit about in our docs here under "Why do I need to train again after annotating with a model in the loop?"
In other words, the real valuable output from the teach recipes is the annotated dataset not the model you used as an aid to select the examples to annotate.

When it comes to seeding the model, is that a separate process that comes before the textcat.teach classification? I was given the impression by the issue linked above that we need to seed each category with different patterns for multi-label, is that correct?

The seeding with patterns is only used for selecting the examples to annotate. For each example, the recipe generates model's predictions and PatternMatcher matches and combines the results. The patterns matches do not have a direct effect on the model training. They will only have an indirect effect by influencing the choice of the examples to annotate.
As for the model categories and label categories: technically the labels used in patterns file do not have any effect on the model. Only the fact that there's a match matters. Nonetheless, I would recommend using the same categories as in the model just for clarity on how representative your pattern set is and easier debugging and possibly reusing patterns for certain categories for additional annotation rounds.

Relatedly, how does seeding the term actually seed the model on the back end? Does it retrain the spaCy model using those seeded terms?

As I mentioned above, the predictions from the model and the PatternMatcher are just combined - this is mostly useful at the beginning to alleviate the cold start problem. Our docs on custom textcat models show in detail how the models are being combined, but the gist of this is:

# In combine_models:
stream1 = one_predict(iter(batch))  # textcat predictions
stream2 = two_predict(iter(batch))  # PatternMatcher predictions
yield from interleave((stream1, stream2))

The PatternMatcher scores each pattern based on the number of hits and this number is eventually used in the sampling method of the textcat.teach recipe. You can find more details on how the score for patterns is computed here under the "How does the confidence scoring work, and how can I use the priors?"

Which step should textcat.teach fall into? Should this be when I create a first batch of training data or to update a model that I've trained with textcat.manual? Can I use a local model to do textcat.teach instead of spaCy's pre-set models?

It is usually more efficient to do some manual annotations first so that the model used in the textcat.teach does not have to start from scratch and you can obtain meaningful suggestions from the get go. So yes, textcat.manual followed by textcat.teach is usually the right workflow. And of course, you can specify your custom model for textcat.teach. One way is to provide the path to where you've saved the trained spaCy pipeline a spacy-model argument.

Finally, something unique about my dataset is that I'm working with long-form article data with certain fields that I'm attaching to the main body of the text such as author bio, the general topic category the article falls into. I wonder if this is something I can leverage to enhance the model further.

I think attaching the topic should help. The most straightforward thing to do would be just to concatenate it (as you did) and see if the model pics up on it.
You might consider providing some markers within the text such as so that you can easily strip it away for experimentation.
Another option would be try some feature engineering and create separate embeddings for the article, bio and topic and use them as separate inputs or concatenate them before feeding into the model, but that of course would require implementing a custom textcat model architecture and I doubt it's worth it, but it's definitely something you could experiment with.
You could also consider using a multi-task learning approach where you train your model to predict not only the text categories but also the topic category. This can help the model learn more robust representations.
Before venturing into something more complex architectures, I would definitely try to get a baseline on articles only and then just concatenate topic and bio and see if that makes a difference using spaCy default textcat architecture.

qu-genesis · February 27, 2025, 9:24pm

Hi @magdaaniol! Thanks a ton for this incredibly detailed explanation it makes a lot more sense now!

To kind of summarize things back to you, terms.teach and subsequently textcat.teach is used to sample from a really large dataset more relevant examples to build up a manual dataset for the model to learn correct? Wherein the purpose of terms.teach is to help identify the low-hanging fruits, where a pattern match helps the model make a correct prediction. And the purpose of textcat.teach to identify those examples that give the model low confidence scores and allow the user (us) to correct it?

And would you have any recommendations on the model to use for prodigy train? I've been using en_core_web_trf because I had assumed the transformer to be the most powerful architecture but would love to hear your recommendations here. The tricky part is that my samples have typically pretty long texts so it is very slow to train/predict.

Thank you so much and grateful for your expertise and for this forum!

magdaaniol · March 3, 2025, 11:30am

Hi @qu-genesis ,

Glad to hear you're finding our forum helpful

terms.teach and subsequently textcat.teach is used to sample from a really large dataset more relevant examples to build up a manual dataset for the model to learn correct?

That's correct, yes.

Wherein the purpose of terms.teach is to help identify the low-hanging fruits, where a pattern match helps the model make a correct prediction. And the purpose of textcat.teach to identify those examples that give the model low confidence scores and allow the user (us) to correct it?

I's say the purpose of terms.teach is to build a terminology list from model's static vector table which you can then use to build patterns from. And patterns do not really help the model make the prediction. As mentioned in my previous post (sorry if it was not clear enough), in the textcat.teach recipe the model and the PatternMatcher provide their predictions independently and it is the recipe logic that combines them to obtain the final confidence score that determines whether the examples is shown to the user or not. So patterns influence model's decisions only indirectly as they influence the choice of training examples.

And would you have any recommendations on the model to use for prodigy train?

Assuming you have created your dataset with textcat.teach which will work better with a CNN model (annotating with the transformer model in the loop is hard because they don't respond very well to small updates), I would definitely train a model using the same architecture that was used during the data collection. After all the training data was selected to boost the confidence of this particular model. You probably also want to include the static vectors as well - they tend to improve the accuracy.
Then, it's of course worth running training experiments with a transformer based model, but as you rightly point out, the decision whether to use them or not is a tradeoff between the accuracy and the training/deployment costs.
For this reason it's good to know how well your CNN model performs so that you can evaluate whether it's worth investing in developing/deploying a transformer based pipeline.

qu-genesis · March 4, 2025, 5:57pm

Thanks so much that answered most of my questions! A couple follow ups on the last paragraph you sent.

Assuming you have created your dataset with textcat.teach which will work better with a CNN model (annotating with the transformer model in the loop is hard because they don't respond very well to small updates), I would definitely train a model using the same architecture that was used during the data collection.

I'm looking at the pretrained spaCy pipelines available and I'm not sure I see a CNN option. Forgive me if this is an amateur question but should I find it here instead?

After all the training data was selected to boost the confidence of this particular model. You probably also want to include the static vectors as well - they tend to improve the accuracy.

How would I go about including this static vector? Is this an option on the command line or something I have to bake into a custom recipe?

magdaaniol · March 5, 2025, 9:52am

Hi @qu-genesis!

I'm looking at the pretrained spaCy pipelines available and I'm not sure I see a CNN option. Forgive me if this is an amateur question but should I find it here instead?

Not to worry - I was being a bit cryptic in my answer when I look back at it now. Out of spaCy pretrained pipelines anything that is not a transformer (i.e. has the _trf in the name) is the CNN based model. So to take English as an example en_core_web_md and core_web_lg are the pipelines that you would typically use.
In your case, I don't think there's much benefit in using a spaCy pretrained pipeline, though. First, because they don't come with a trained textcat component (textcat task is usually very domain specific so there's not much sense in having a general textcat component) and second, even if there was, it's unlikely that its categories would overlap with your target categories. That's the reason it is recommended to prepare some manual annotations with textcat.manual, train a (CNN) pipeline and use this custom pipeline for textcat.teach.
To help you with the training config, you can use spaCy quickstart widget. If you choose a CPU for hardware that would automatically generate a config for building a CNN network, conversely, by choosing a GPU, you'll get a transformer network.

How would I go about including this static vector? Is this an option on the command line or something I have to bake into a custom recipe?

That is something you specify in the training config. Here you can find the relevant spaCy doc for more context, but if in the quickstart widget you set "optimize for accuracy" you'll see that it will add the following to the config:

[paths]
vectors = "en_core_web_lg"

This is to instruct the model to source the vectors from en_core_web_lg.
To make sure the textcat component uses the static vectors as features , the widget will set include_static_vectors to true in the textcat tok2vec layer definition:

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

qu-genesis · March 12, 2025, 7:41pm

Hey @magdaaniol, no worries at all and this additional clarification is incredibly helpful. I think I have a much better understanding of textcat.teach and pattern matching now. I'll go about starting this workflow and will definitely reach out if I need clarification.

I'd love for you to get a second look on my proposed workflow though and see if you have any suggestions!

Workflow

terms.teach to identify a vocabulary list

textcat.manual → train to build up a base model

textcat.teach to filter and add to dataset with samples the model is most uncertain about

textcat.correct or textcat.manual for newer more important samples

retrain model with spaCy

Also I had this idea of using a LLM as a first pass to go through each record and assign its classification best that it can, and then train the first base model using that rough LLM classifier. Is this recommended?

magdaaniol · March 13, 2025, 11:36am

Hi @qu-genesis,

Sure thing! Your plan misses the training step after textcat.teach session. Remember, it's important to retrain using multiple passes over the entire dataset to get the most optimal learning from the dataset so:

Workflow

terms.teach to identify a vocabulary list
textcat.manual -> train to build up a base model
textcat.teach to add the to the dataset the examples the model is most uncertain about
train -> the get the fully trained model
textcat.correct / textcat.manual to improve the model and/or add new examples
retrain

As for the LLMs, how effective this can be to collect the data depends of course on the final prompt/LLM you would use. It's defnitely worth checking out especially that it is so easy to set up in Prodigy. I would recommend curating LLM labels before training the initial model. So the version of the workflow with the LLMs would be:

textcat.llm.fetch to collect LLM annotations
textcat.manual to curate LLM annotations
(you can also merge these two steps by using textcat.llm.correct but you'd be making calls to LLM API while annotating which may sometimes result in longer loading times)
train to build up a base model
textcat.teach
train the full model
textcat.correct if necessary

qu-genesis · March 13, 2025, 9:32pm

This is absolutely beautiful thank you @magdaaniol!

qu-genesis · March 18, 2025, 3:54pm

Hi @magdaaniol, back with another question about textcat.llm.correct.

I'm currently running the GPT3.5 model and the results are not very satisfactory, which is fair since I'm asking the model capture some deeper story-format and abstract information.

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v2"
config = {"temperature": 0.3}

However, when I want to switch to spacy.GPT-4.v1 as defined in the spaCy documentation, I'm getting an error that my prodigy isn't able to find this model.

[components.llm.model]
@llm_models = "spacy.GPT-4.v1"
name = "gpt-4"
config = {"temperature": 0.0}

llm-annotate-1 | ValueError: The specified model 'gpt-4' is not available. Choices are: ['babbage-002', 'computer-use-preview', 'computer-use-preview-2025-03-11', 'dall-e-2', 'dall-e-3', 'davinci-002', 'gpt-3.5-turbo', 'gpt-3.5-turbo-0125', 'gpt-3.5-turbo-1106', 'gpt-3.5-turbo-16k', 'gpt-3.5-turbo-instruct', 'gpt-3.5-turbo-instruct-0914', 'gpt-4.5-preview', 'gpt-4.5-preview-2025-02-27', 'gpt-4o', 'gpt-4o-2024-05-13', 'gpt-4o-2024-08-06', 'gpt-4o-2024-11-20', 'gpt-4o-audio-preview', 'gpt-4o-audio-preview-2024-10-01', 'gpt-4o-mini', 'gpt-4o-mini-2024-07-18', 'gpt-4o-mini-audio-preview', 'gpt-4o-mini-audio-preview-2024-12-17', 'gpt-4o-mini-search-preview', 'gpt-4o-mini-search-preview-2025-03-11', 'gpt-4o-search-preview', 'gpt-4o-search-preview-2025-03-11', 'o1', 'o1-2024-12-17', 'o1-mini', 'o1-mini-2024-09-12', 'o1-preview', 'o1-preview-2024-09-12', 'o3-mini', 'o3-mini-2025-01-31', 'omni-moderation-2024-09-26', 'omni-moderation-latest', 'text-embedding-3-large', 'text-embedding-3-small', 'text-embedding-ada-002', 'tts-1', 'tts-1-1106', 'tts-1-hd', 'tts-1-hd-1106', 'whisper-1']

Any advice on how to improve my model performance in this first pass?

magdaaniol · March 19, 2025, 11:51am

Hi @qu-genesis,

It's odd that you're getting this error because the config snippet you provided is definitely valid and it works as expected on my end.
Could you share your spaCy version, spacy-llm version and the prodigy version (e.g. by running python -m pip list | grep -E "spacy|spacy-llm|prodigy"?
Could you also share the entire config file?
Thanks!

As to the recommendations to improve the performance of the LLM, it's hard to say anything without a deeper understanding of the use case, but it usually boils down to prompt engineering. You can experiment with providing label definitions and examples, for instance. It might also be that you'd need to write an entirely custom prompt, which requires the implementation of a custom spacy-llm task. This post shows an example of a custom task in a Prodigy recipe. Not sure I've you seen it, but Prodigy has some recipes (for example ab.llm.tournament) that can help with selecting the best prompt for your use case by measuring preference in a systematic and structured way (rather than relying purely on the impressions), which might be worth trying out.
Finally, since you're using OpenAI, it's probably worth reviewing their guide to effective prompt writing.

qu-genesis · March 19, 2025, 3:38pm

Hi hi! I ended up figuring that error out, I needed to just discard the double quotes from the name config. So this ended up working for me:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]
[components.llm]
factory = "llm"
save_io = true

[components.llm.task]
@llm_tasks = "spacy.TextCat.v3"
labels = ["INVERTED_PYRAMID", "INVESTIGATIVE_REPORTING", "EXPLANATORY", "FEATURE"]
exclusive_classes = false

[components.llm.task.label_definitions]
INVERTED_PYRAMID = "A short to medium-length article providing facts and new information to readers, typically presenting the most important information first. You can identify it by the fact that it reports on a specific current event, has a straightforward lede without opinion or analysis, and written in objective style."
INVESTIGATIVE_REPORTING = "Involves in-depth research to uncover facts that may be hidden or complex, often requiring extensive source work and data analysis. Not that this should be Star Tribune investigations and not reporting on federal/state investigations. It should contain data and extensive documentation as well as revealing systemic issues. A signal of an investigation story is: “... found that…” or “an analysis revealed…”."
FEATURE = "Longer, detailed article that zeroes in on a subject or a person and tells their backstory, offering historical or personal context. Focuses on the experiences of individuals affected by events or acting in interesting or extraordinary ways, creating emotional connection with readers. The style should be more literary as well."
EXPLANATORY = "A story that explains a complex issue or provides context, helping readers understand complicated subjects or situations. Often in the forms of 'Five things you need to know about ...' or 'Your questions answered' or 'How X affects you' or 'Your guide to ...""

# Add the examples
[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "/app/config/few-shot-examples-news.yml"

[components.llm.model]
@llm_models = "spacy.GPT-4.v3"
name = gpt-4o-mini
config = {"temperature": 0.3}

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "news-article-cache"
batch_size = 5
max_batches_in_mem = 4

And here is my version as well!

prodigy 1.18.0
spacy 3.7.2
spacy-legacy 3.0.12
spacy-llm 0.7.3
spacy-loggers 1.0.5

I do have another question now that I've trained the model and gotten to the textcat.teach part of my workflow. I'm seeing a lot of errors that the model is making as I start textcat.teach (which makes sense because the model is prioritizing answers it is most uncertain about. But I only have the binary option to accept and reject. In this case, does accepting mean that we are going to put this example in the new training set? If so would we have to go in after textcat.teach and assign the correct label to it?

Thanks so much for clarifying this for me!

magdaaniol · March 20, 2025, 11:32am

Great! Thanks for sharing the cause of the problem!

In this case, does accepting mean that we are going to put this example in the new training set? If so would we have to go in after textcat.teach and assign the correct label to it?

When you hit accept, you accept the example with the label, so there's no need to relabel it. In the case of multilabel text classification, the model gets updated with the information for this particular label only. It doesn't update anythig about the remaining labels. The remaining labels will appear in separate binary decision task for this example.

Also, I recommend this post which explains a bit more about the effects of binary annotations on the dataset as well as how you can modify the selection of examples by choosing a different sorter (please note that this is an older post and it makes references to a deprecated batch-train recipe, but the general principle it describes is still valid). prefer_uncertain is an optimal choice for most active learning scenarios, but if you have a very high number of labels in multilabel scenario and you're seeing yourself clicking through mostly negative examples, you might be better off with collecting more positive examples at the beginning and switching to uncertainty sampling with a stronger baseline model.

If you'd like to apply a different scorer, you can modify the source code of the textcat.teach recipe available in your Prodigy installation path and change line 100 to:

stream.apply(lambda d: prefer_high_scores(predict(d)))

and also the import on line 22:

from ..components.sorters import prefer_uncertain, prefer_high_scores

qu-genesis · March 20, 2025, 2:50pm

This makes so much sense thank you for sending that post!

A couple of questions that I have as follow ups to that post:

One quirk I've noticed as I go through my textcat.teach sample is that it seems to show me a smaller subset of my full data in the source. I've only gone through 50 examples so far but the progress bar shows 17% already, while I know my full dataset contains at least 3,000 examples. Ostensibly the samples here should be even higher because we are only annotating binary model predictions. Is this because it's only selecting the low confidence score samples? Here's what I'm running in my docker container.

python -m prodigy textcat.teach teach_data /app/models/first-pass/model-best /app/data/data.jsonl --label "INVERTED_PYRAMID","INVESTIGATIVE_REPORTING","EXPLANATORY","FEATURE"

The other thing I've noticed is that the recipe picked up some samples where the score is 0.01 or 1.00, I had thought that it prioritizes the examples with low confidence scores. Is there another reason these examples are included?
Additionally, I'm wondering if the best practice would be to train the model eventually on the full set of datapoints that I've annotated through textcat.llm.correct earlier, through textcat.teach now and through textcat.correct later. This would involve, I'm assuming, combining the prodigy datasets we labeled in each step together. Is there a recipe that helps us do this?
The complication I also foresee is that I'm accepting/rejecting some of the same data I used to train this model. If I combine those datasets, would that cause problems for the model? Would I need a custom script to reconcile these duplicated samples?

Thanks a ton @magdaaniol!

magdaaniol · March 21, 2025, 6:49pm

Hi @qu-genesis,

Re progress bar and smaller dataset subset
The progress bar in textcat.teach is proportional to the decrease in the loss function. In other words, it shows how many more examples are needed before the loss hits zero. Naturally, you'd expect to only annotate a subset of the dataset as you are filtering the most relevant examples.

In a multilabel scenario, there is a risk that the model may get overly confident due to a potentially skewed exposure to a subset of labels. Since each example is evaluated separately for each label and you're only updating one label at a time, this might lead more aggressive filtering effect, that you're observing. The uncertainty dynamics become more complex with multiple labels. Many examples might be highly certain for most labels and only uncertain for one or two, causing the sampling algorithm to identify a smaller pool of truly uncertain examples across all labels. This explains the heavy filtering you're experiencing.

This also brings me to one feature of textcat.teach which I probably should have pointed out before (I just forgot your labels are exclusive - sorry!), it assumes that the annotations are non-exclusive. This ability to learn from partial information is useful for a non-exclusive scenario, but can be inefficient for the exclusive scenario because you could be updating all labels with each annotation which should lead to more effective filtering.

For this reason, it would be recommended to run textcat.teach for each label separately. After all, what's uncertain for one label is not necessarily uncertain for other labels. The active learning procedure should be more effective, the model should converge much faster and you'd be sure to collect relevant examples for all of your labels.

Re High/Low Confidence Examples
I can't say for sure what's exactly responsible for the appearance of high/low confidence examples, but I suspect it might be due to the initial exploration phase of the exponential moving average algorithm, which selects a batch of initial examples (by default, the first 64) and shows a portion of them regardless of uncertainty to establish a baseline:

prebatch = list(take(self.first_n, self.stream))
prebatch.sort(reverse=True, key=lambda x: x[0])
for priority, task in prebatch[: self.first_n // 10]:
    yield task

Another reason could be the dynamic patience threshold, which is adjusted based on how many examples it's recently chosen to show or skip. If it's been showing many examples (increasing patience), the threshold becomes higher, meaning it becomes more selective. If it's been skipping many examples (decreasing patience), the threshold becomes lower, meaning it becomes less selective. The net effect can indeed result in occasional high-confidence (low uncertainty) examples as the algorithm is self-regulating to ensure a reasonable flow of examples by adapting to the distribution of uncertainty in your dataset.

RE combining datasets
Both data-to-spacy and train recipes take care of merging the annotations from different datasets. For training the spaCy textcat component (which is the one you should use for exclusive labels), the annotations should consistently be exclusive as this will determine which fields in the Prodigy task will be taken into consideration while creating the final training example.

qu-genesis · March 24, 2025, 7:24pm

Hey @magdaaniol , thanks again for the thorough explanation! This is really incredibly helpful and helps me re-orient my project a bit! I was going through the workflow we outlined together earlier and seeing a few issues that made me concerned about the classification accuracy.

A bit of the backstory
Some context that would have been helpful a while ago (sorry!), I'm working with news stories that varies in length but usually are at least 8-10 paragraphs long. I want to capture the storytelling format within each article: whether they are explanatory, analytical, investigations, or just in a straight-up news format (often called inverted pyramids in journalism lingo). As you might imagine, the dataset here is highly imbalanced, with very few investigations in the full data. Another quirk is that I'm working with classifying some long-form text that my intuition tells me could make training accuracy worse. I have actually been using the non-exclusive labels so been training with textcat-multilabel.

What I did so far
I took a sample of about 3,000 news stories, appended metadata to it and used the LLM as a first pass hand annotation. I used the textcat.llm.correct recipe because I did not fully trust how well the LLM would pick up on the subtle differences in the voices and how to differentiate the story-telling formats. It particularly struggled around investigations and straight news stories.

I labeled 200 of these training samples (which I now think was definitely not enough for the base model), and used data-to-spacy to export it to the binary .spacy format. Trained it on the en_core_web_lg model using the accuracy-preferred config on CPU. Running spacy evaluate showed me that the results here were underwhelming, I find 0 or extremely low F-1 scores for all the categories except for Inverted Pyramid. The textcat AUC was under 50%.

================================== Results ==================================

TOK                   100.00
TEXTCAT (macro AUC)   45.06 
SPEED                 1852  

=========================== Textcat F (per label) ===========================

                              P        R       F
INVERTED_PYRAMID          78.05   100.00   87.67
INVESTIGATIVE_REPORTING    2.44   100.00    4.76
EXPLANATORY                0.00     0.00    0.00
FEATURE                    0.00     0.00    0.00

And after running textcat.teach and retraining on another 700 samples, I was able to boost the textcat AUC to 55.26% but what was confusing to me was that the P, R, and F scores for each label all dropped to 0. I played around with my prediction threshold but that didn't end up moving the needle. If you have an explanation or a suggestion for this issue I'd be really curious.

Some lessons I think I learned going forward
I wanted to kind of float these intuitions I have to you and see if you also had any suggestions or feedback.

I'm realizing that I need way more training samples for the base model especially due to the class imbalance in my dataset and due to the complexity of the task involved. The spacy evaluate script warned me that 2000 samples are needed, so I might have to go back in and just get more training data.
I'm a bit more concerned about the text length of my samples now, I suspect the low accuracy is also due to context dilution and the model's inherent limitation in picking up interactions between long tokens. I want to experiment with truncating each article to the first few paragraphs + the last few paragraphs.
I don't know if the CNN based models are cutting it. The encoding layer has a window_size of 1 which I feel like will limit its ability to pass through long-range dependencies in the text. Perhaps it also has to do with me using unigrams in my BOW model. I wonder if adjusting these arguments will increase my accuracy or perhaps using GPU and a transformer based model instead, especially since speed matters less to me compared to accuracy.
In the textcat.teach component I'll definitely call active learning one label at a time.
Noted on combining datasets too, that's really good to know.

Thanks so much for such an amazing resource!

magdaaniol · March 26, 2025, 1:51pm

Hi @qu-genesis,

I think the uncertainty sampling with multiple labels in textcat.teach introduced more ambiguous cases that confused the model's previously learned patterns. This may be related to fact that the sampling might have been suboptimal due to working with multiple labels while updating only one label at a time (as discussed in my previous post)
I would definitely recommend working with one label at a time to try to gather more examples from underrepresented classes. You will then be able to merge them into the format expected by spaCy texcat_multilabel by exporting them with data_to_spacy.
Some users tried implementing texcat.teach with multiple labels so that the model gets updated on all labels with each non-exlusive annotations but I'm not sure how would that affect the active learning effectiveness. It might be that it will take longer to converge in comparison to doing focused one-label-at-a-time sessions.

Re: number of examples
Definitely more is needed, I would recommend following spaCy general purpose advise here. You can also run spacy data debug once you've converted your data to spaCy DocBin format to get some more structural insights on your dataset.

Re: document length & architecture
All the architectures use some sort of pooling of the token vector representations. This allows to process documents that are arbitrarily long, but you're definitely right in thinking that it might lead to some context dilution. Now, some techniques are more prompt to this context dilution than others. BOW just pools the n-gram representations without taking token position into account which makes it least appropriate for long texts. For a CPU solution, the ensemble model would probably be better as it uses attention in the textcat component, but if you can work on a GPU then a transformer-based architecture would definitely handle better long-distance relationships in text thanks to self-attention.
That said, splitting the text into paragraphs and preprocessing the input to select the most representative parts makes a lot of sense and is spaCy's developers general advice due to more efficient memory use.

Topic		Replies	Views
textcat.teach showing same text twice (and not using active learning?) textcat	15	2300	August 15, 2018
textcat teach examples from source or from dataset usage , textcat	10	839	August 15, 2019
Help needed to get started with text classification usage , textcat	10	3519	January 14, 2019
textcat.teach not taking into account label value textcat , done	4	602	December 7, 2018
Best use of `textcat.teach` usage , textcat	2	1433	June 18, 2020

How textcat.teach works under the hood

Workflow

Related topics