Textcat possible problem with uneven dataset?

Hi everyone, I'm just getting started Prodigy having moved from AutoML. Loving Prodigy / Spacy. I was wondering if this behaviour was expected below.

We've been training a custom model using the following;

prodigy textcat.manual business_model ./BusinessData.txt --label account,contact_info,online,other,sales,service,tracking --exclusive

Our BusinessData.txt file contains verbatim from IVR, some examples below;

I would like to link my account
the amount that was paid into the account
I would like to speak to the credit manager
I would like to track the status of my shipment

So far we've been using the manual recipe to assign one exclusive label.

After classifying probably a thousand records we trained a model;

prodigy train textcat business_model blank:en --output ./models/business_model

We then load this model with the spacy.load via python shell.

What we have observed is that the model is being biased towards the more frequently used labels. In some cases, even when using the exact phrase that was used for a categorisation ie.

{"text":"I would like to make a complaint about a delivery driver","_input_hash":442155054,"_task_hash":-1466638285,"options":[{"id":"account","text":"account"},{"id":"contact_info","text":"contact_info"},{"id":"online","text":"online"},{"id":"other","text":"other"},{"id":"sales","text":"sales"},{"id":"service","text":"service"},{"id":"tracking","text":"tracking"}],"_session_id":null,"_view_id":"choice","accept":["other"],"answer":"accept"}

It's being wrongly classified;

{
"service": 0.07783517241477966,
"online": 0.0007357532740570605,
"account": 0.0011064341524615884,
"tracking": 0.6002848148345947,
"other": 0.017688263207674026,
"sales": 0.07158508896827698,
"contact_info": 0.0007702212897129357
}

Below is the output from our train;

Label ROC AUC


service 0.817
online 0.445
account 0.947
tracking 0.914
other 0.788
sales 0.882
contact_info 0.234

Best ROC AUC 0.718
Baseline 0.421

Can I ask, is there something we've done wrong - or does the model just require further training? We've run the train-curve command and it does suggest further training, but the accuracy only improves by 0.01 in the last run. Some runs it's actually a negative.

If we haven't done anything wrong, what would you suggest at this point? textcat.teach perhaps? And provide some training for the labels that are used less frequently?

Also - as a side note, is there a command we can run to see how many labels of each exist in a dataset? At present we're doing a db-out and figuring this out from the jsonl file.

Thanks very much!

Prodigy doesn't autodetect whether you have mutually exclusive classes, so prodigy train has an option -TE that should be used when training a model with mutually exclusive classes. It looks like this option (as -E) works fine in annotation, but when I tested it with prodigy train I noticed that there's a bug in the model configuration.

It looks like the command-line option -TE is being ignored, so it's training a multilabel model, which will not perform as well (which also provides ROC AUC scores instead of F-scores averaged over all labels, which is a clue from your output). Until we release a new version with a fix, you can add this in recipes/train.py around line 86 right after pipe_cfg = {}:

    pipe_cfg = {}
    if component == "textcat":
        pipe_cfg = { 
            "exclusive_classes": textcat_exclusive,
        }

I don't think there's a built-in prodigy function to count labels, so the easiest way I can think of is to convert with prodigy data-to-spacy and then analyze with spacy debug-data:

prodigy data-to-spacy -tc dataset spacy-data.json
spacy debug-data en spacy-data.json spacy-data.json -p textcat -V

The verbose output (-V) will show the counts for each category. (debug-data requires both train and dev sets, so just ignore the warnings about overlapping texts.)

Most of what Matt says here is relevant: Best practices & realistic expectations with high number of classes for multiclass text classification task

The main thing that's changed since that thread is that spacy train now supports -p textcat, so you can use prodigy data-to-spacy and then have more options training with spacy directly. I'd also recommend trying bow instead of simple_cnn for small datasets, which you can set with the option --textcat-arch bow. After training with spacy, if you look in meta.json for model-best in the output, you can see the individual P/R/F scores for each of the labels, which might be more useful than the averaged F-score from prodigy train.

spacy debug-data and spacy train will also try to detect whether you have mutually exclusive labels and spacy train will show warnings if your settings don't seem to match your data.

If you want to use bow in prodigy train and prodigy train-curve, you can add it to the same pipe_cfg above as "architecture": "bow".

1 Like

Thanks very much. I've modified the train.py and added the code you requested. Below are the commands I've executed;

prodigy textcat.manual -E business1 ./BusinessData.txt --label account,contact_info,online,other,sales,service,tracking
Using 7 label(s): account, contact_info, online, other, sales, service, tracking

After some annotation I than ran the following;

prodigy train -TE textcat business1 blank:en --output ./models/business1

Below is a snippet of the export;

Loss F-Score


1 9.39 14.505
2 8.88 8.163
3 7.61 8.163
4 7.76 8.163

I don't believe this has made any positive change? I located the train.py file and temporarily renamed this and attempted to run prodigy train to ensure I had the correct train.py - which I did. Would you have any other suggestions?

Many thanks,
Michael