Textcat possible problem with uneven dataset?

Prodigy doesn't autodetect whether you have mutually exclusive classes, so prodigy train has an option -TE that should be used when training a model with mutually exclusive classes. It looks like this option (as -E) works fine in annotation, but when I tested it with prodigy train I noticed that there's a bug in the model configuration.

It looks like the command-line option -TE is being ignored, so it's training a multilabel model, which will not perform as well (which also provides ROC AUC scores instead of F-scores averaged over all labels, which is a clue from your output). Until we release a new version with a fix, you can add this in recipes/train.py around line 86 right after pipe_cfg = {}:

    pipe_cfg = {}
    if component == "textcat":
        pipe_cfg = { 
            "exclusive_classes": textcat_exclusive,
        }

I don't think there's a built-in prodigy function to count labels, so the easiest way I can think of is to convert with prodigy data-to-spacy and then analyze with spacy debug-data:

prodigy data-to-spacy -tc dataset spacy-data.json
spacy debug-data en spacy-data.json spacy-data.json -p textcat -V

The verbose output (-V) will show the counts for each category. (debug-data requires both train and dev sets, so just ignore the warnings about overlapping texts.)

Most of what Matt says here is relevant: Best practices & realistic expectations with high number of classes for multiclass text classification task

The main thing that's changed since that thread is that spacy train now supports -p textcat, so you can use prodigy data-to-spacy and then have more options training with spacy directly. I'd also recommend trying bow instead of simple_cnn for small datasets, which you can set with the option --textcat-arch bow. After training with spacy, if you look in meta.json for model-best in the output, you can see the individual P/R/F scores for each of the labels, which might be more useful than the averaged F-score from prodigy train.

spacy debug-data and spacy train will also try to detect whether you have mutually exclusive labels and spacy train will show warnings if your settings don't seem to match your data.

If you want to use bow in prodigy train and prodigy train-curve, you can add it to the same pipe_cfg above as "architecture": "bow".

1 Like