Textcat.teach doesn't work to update the text classification model with exclusive classes.

Hello!
I'm trying to create a model that sorts news article headlines by polarity and update it in textcat.teach. There are three classes: positive, neutral, and negative. They are exclusive and the samples belong to only one class.
When I updated the model, the score got worse.
What is the cause of this? Is this task unsuitable for textcat.teach, or is my procedure wrong?
Below are my steps.

  1. Create 1000 annotations
prodigy textcat.manual [dataset_1] [news_title1000.jsonl] --label positive, neutral, negative --exclusive
  1. Train the model with the created dataset
prodigy train textcat [dataset_1] ja_core_news_lg --output [model1000_path] --eval-split 0.1 --n-iter 30 --batch-size 64 --textcat-exclusive
  1. 4000 active learnings on textcat.teach
prodigy textcat.teach [dataset_3] [model1000_path] [new_news_title4000.jsonl] --label positive, neutral, negative
  1. Train model1000 with the dataset created in step3
prodigy train textcat [dataset_3] [model1000_path] --output [ALmodel_path] --eval-split 0.1 --n-iter 30 --batch-size 64 --textcat-exclusive

result

[Model before update]
Label F-Score
-------- -------
positive 56.522
neutral 61.728
negative 76.712

Best F-Score 64.987
Baseline 19.718

[Updated model]
Label F-Score
-------- -------
positive 71.984
neutral 50.526
negative 0.000

Best F-Score 40.837
Baseline 36.644

Hi! The textcat.teach recipe will only suggest you examples to annotate, so if the annotations you collect here are consistent, whether they were created with textcat.teach or some other process doesn't matter – it's all about how you train from them.

Instead of updating the model artifact multiple times, try training from the blank base model using all annotations you've collected. If you train multiple times with different datasets, it's much harder to reason about the results and you may have to deal with "forgetting effects" etc.

It also looks like you're not using a dedicated evaluation set and just hold back 10% of the data, which seems very little? This means that you can't really compare the accuracy between training runs – what ends up in those 10% will be super different each time because the data is different, and potentially not representative. That makes it very hard to know if your model is improving.

Thank you for your reply.

The textcat.teach recipe will only suggest you examples to annotate,

I was totally misunderstood about this! Thank you for teaching me.

Instead of updating the model artifact multiple times, try training from the blank base model using all annotations you've collected.

Following this advice, I tried changing step 4 this way. Train a blank base model using the annotation db in step 1 and the annotation db in step 3. The eval split has been increased from 10% to 20%.

prodigy train textcat [dataset_1, dataset_3] ja_core_news_lg --output [model_path] --eval-split 0.2 --n-iter 30 --batch-size 64 --textcat-exclusive

But Best F-Score 23.222.

The model trained with 1000 annotations made with textcat.manual was Best F-Score 64.987, but why is it so bad when combined with the annotations made with textcat.teach?

I wonder if there's a weird/unintended interaction here because of the different data types you've collected and the mix of complete and incomplete annotations: the textcat.manual annotations contain all labels and a definitive answer, whereas the textcat.teach annotations are binary yes/no answers and may not include the final answer (e.g. you may only know that label X doesn't apply to a given text). Prodigy should be able to handle both types, but maybe something is going wrong somewhere :thinking:

Could you try running prodigy data-to-spacy with your two datasets, train with spacy train directly and check the results you get there?

Thanks for your advice! :smiley:
textcat.teach annotations do not have a definitive answer. Certainly it is. Unfortunately, I'm new to spaCy and it seemed difficult to try spacy train, so I tried two that seemed easier.

  1. I removed the answer: reject sample.
    I removed those samples that only exist in textcat.teach annotations. textcat.teach annotations has been reduced from 4000 to 1600.
  2. I made the two data formats the same.
    The formats of the label and the answer part of the two data are different as follows.
textcat.manual
"options": [{"id": "positive", "text": "positive"}, {"id": "neutral", "text": "neutral"}, {"id": "negative", "text": "negative"}], "accept": ["neutral"], "answer": "accept"
textcat.teach
"label": "positive", "answer": "reject"

I unified this into the format of textcat.teach. (But this wasn't in the advice and may not affect the outcome.)

Then I got good results.

[Model made from this dataset:
1000 textcat.manual annotations + 1600 textcat.teach annotations ("answer": "accept" only)]

Label F-Score
-------- -------
negative 78.498
positive 75.410
neutral 63.372


Best F-Score 72.427
Baseline 15.041

This is the result of updating Best F-Score 64.987 for the model with 1000 textcat.manual annotations.

Is the method I tried a valid step for active learning in this text classification task?

The active learning mostly happens during annotation and helps with the example selection – it should obviously als have an impact on training because the selected examples are better, but that's a bit more indirect. The "magic" happens when you annotate.

If you want to experiment with converting your annotations, I would recommend doing it the other way around and creating one annotation per example with multiple options.

Also keep in mind that unless you use a dedicated evaluation set, the results you're seeing aren't necessarily comparable. If you're evaluating binary decisions on a selection of sparse binary annotations, you may see a higher number at the end, but that doesn't mean that your model is "better".