Textcat possible problem with uneven dataset?

adriane · January 13, 2020, 8:35pm

Prodigy doesn't autodetect whether you have mutually exclusive classes, so prodigy train has an option -TE that should be used when training a model with mutually exclusive classes. It looks like this option (as -E) works fine in annotation, but when I tested it with prodigy train I noticed that there's a bug in the model configuration.

It looks like the command-line option -TE is being ignored, so it's training a multilabel model, which will not perform as well (which also provides ROC AUC scores instead of F-scores averaged over all labels, which is a clue from your output). Until we release a new version with a fix, you can add this in recipes/train.py around line 86 right after pipe_cfg = {}:

    pipe_cfg = {}
    if component == "textcat":
        pipe_cfg = { 
            "exclusive_classes": textcat_exclusive,
        }

I don't think there's a built-in prodigy function to count labels, so the easiest way I can think of is to convert with prodigy data-to-spacy and then analyze with spacy debug-data:

prodigy data-to-spacy -tc dataset spacy-data.json
spacy debug-data en spacy-data.json spacy-data.json -p textcat -V

The verbose output (-V) will show the counts for each category. (debug-data requires both train and dev sets, so just ignore the warnings about overlapping texts.)

Most of what Matt says here is relevant: Best practices & realistic expectations with high number of classes for multiclass text classification task

The main thing that's changed since that thread is that spacy train now supports -p textcat, so you can use prodigy data-to-spacy and then have more options training with spacy directly. I'd also recommend trying bow instead of simple_cnn for small datasets, which you can set with the option --textcat-arch bow. After training with spacy, if you look in meta.json for model-best in the output, you can see the individual P/R/F scores for each of the labels, which might be more useful than the averaged F-score from prodigy train.

spacy debug-data and spacy train will also try to detect whether you have mutually exclusive labels and spacy train will show warnings if your settings don't seem to match your data.

If you want to use bow in prodigy train and prodigy train-curve, you can add it to the same pipe_cfg above as "architecture": "bow".

Topic		Replies	Views
textcat vs textcat_multilabel usage , textcat , training	12	3269	September 13, 2023
Train a textcat model after it has been 'prodigy.teach'ed with 3 labels usage , textcat	5	574	November 16, 2020
textcat training with only one label textcat	1	156	January 17, 2024
mutually exclusive classes and textcat.batch-train usage , textcat	5	727	July 1, 2019
Textcat - teach to train. usage , textcat	2	555	September 1, 2022

Textcat possible problem with uneven dataset?

Related topics