textcat vs textcat_multilabel

I'm using prodigy to annotate my data using one single label "Sectoral", where I either reject or accept. Now my labels are mutually exclusive, each entry is either sectoral or not.

I don't understand why I cannot use the regular textcat recipe if It's suppose to work with mutually exclusive classes. In theory this is just two mutually exclusive classes, one where I "accept" and another one when I "reject". Why do I get this error?

"The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label."

For textcat the model is designed so that the predictions for all categories always add up to 1.0, so if you just have one category, it will predict 1.0 for that category every single text.

For textcat_multilabel each individual category prediction can range from 0.0-1.0 independently from the predictions for the other categories for that text.

It's possible the default settings/options in prodigy could be improved for this particular kind of task, though. What are the commands you're using in your workflow?

1 Like

The problem with this is that in practice, you still only have one label (Sectoral) that either applies or doesn't. And there might be use cases where you want to combine this dataset with other binary datasets for other labels and train an exclusive or non-exclusive classifier on all the data.

The alternative would be for Prodigy to add a label NOT_SECTORAL (or OTHER), but that feels like a very invasive default behaviour because it really modifies the data. So if you only have one label you're predicting, the easier solution would be to use the textcat_multilabel component instead.

1 Like

Hi, I understand, however it seems counterintuitive to use textcat_multilabel when I only have one label.

I'm first using texcat.manual to annotate over one single label "sectoral". So what I get as a result is annotations with only one label that I either accept or reject.

Then I use the following:

prodigy train ./path --textcat sectoral_annotations --base-model en_core_web_lg --eval-split 0.5

We're just testing different models right now. But I can't run it because I get that error "The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label."

It seems counterintuitive that I need to use textcat_multilabel for just one label.

I get that is a easy solution. However, it still seems counterintuitive that inside the Prodi.gy environment I have to use textcat_multilabel when I'm only using one label that is either "yes" or "no".

Yeah, I definitely see what you mean! This is how spaCy currently handles it so we tried to express that 1:1 in Prodigy. There's always this trade-off between handling stuff automatically under the hood and being a bit magical, and trying to be too clever / too magical and complicating things this way :sweat_smile: Like, an alternative would be to automatically use the textcat_multilabel component undert the hood if there's only one label but this can potentially lead to unintuitive results. Or we could add a second label automatically, but this introduces the question of how to name it.

I'll definitely keep thinking about this, though – maybe there's a compromise and we can handle this better in Prodigy!

1 Like