textcat vs textcat_multilabel

I'm using prodigy to annotate my data using one single label "Sectoral", where I either reject or accept. Now my labels are mutually exclusive, each entry is either sectoral or not.

I don't understand why I cannot use the regular textcat recipe if It's suppose to work with mutually exclusive classes. In theory this is just two mutually exclusive classes, one where I "accept" and another one when I "reject". Why do I get this error?

"The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label."

1 Like

For textcat the model is designed so that the predictions for all categories always add up to 1.0, so if you just have one category, it will predict 1.0 for that category every single text.

For textcat_multilabel each individual category prediction can range from 0.0-1.0 independently from the predictions for the other categories for that text.

It's possible the default settings/options in prodigy could be improved for this particular kind of task, though. What are the commands you're using in your workflow?

1 Like

The problem with this is that in practice, you still only have one label (Sectoral) that either applies or doesn't. And there might be use cases where you want to combine this dataset with other binary datasets for other labels and train an exclusive or non-exclusive classifier on all the data.

The alternative would be for Prodigy to add a label NOT_SECTORAL (or OTHER), but that feels like a very invasive default behaviour because it really modifies the data. So if you only have one label you're predicting, the easier solution would be to use the textcat_multilabel component instead.

1 Like

Hi, I understand, however it seems counterintuitive to use textcat_multilabel when I only have one label.

I'm first using texcat.manual to annotate over one single label "sectoral". So what I get as a result is annotations with only one label that I either accept or reject.

Then I use the following:

prodigy train ./path --textcat sectoral_annotations --base-model en_core_web_lg --eval-split 0.5

We're just testing different models right now. But I can't run it because I get that error "The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label."

It seems counterintuitive that I need to use textcat_multilabel for just one label.

1 Like

I get that is a easy solution. However, it still seems counterintuitive that inside the Prodi.gy environment I have to use textcat_multilabel when I'm only using one label that is either "yes" or "no".

Yeah, I definitely see what you mean! This is how spaCy currently handles it so we tried to express that 1:1 in Prodigy. There's always this trade-off between handling stuff automatically under the hood and being a bit magical, and trying to be too clever / too magical and complicating things this way :sweat_smile: Like, an alternative would be to automatically use the textcat_multilabel component undert the hood if there's only one label but this can potentially lead to unintuitive results. Or we could add a second label automatically, but this introduces the question of how to name it.

I'll definitely keep thinking about this, though – maybe there's a compromise and we can handle this better in Prodigy!

1 Like

Hi - How do you specify labels to the train command? The documentation doesn't specify how to pass the 2 labels to the train function. Also I'm guessing the labels have to line up with the ones you used to annotate the dataset.

So far my command is prodigy train ebf/ --textcat ebf_0,ebf_1,ebf_2 --eval-split 0.2 --base-model en_core_web_lg but any attempt to pass a --label parameter has failed

hi @spothedog1,

prodigy train doesn't use a --label argument. It should automatically infer the labels based on the dataset.

Are you able to run this without --label argument?

prodigy train ebf/ --textcat ebf_0,ebf_1,ebf_2 --eval-split 0.2 --base-model en_core_web_lg

Are you getting any other errors?

No, I'm getting the error

"The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label."

I labeled the data in Prodigy using the command
textcat.teach ebf_0 en_core_web_sm ./data/articles_0.jsonl --label economy_business_finance --patterns ./ebf/patterns.jsonl

When I run db-out it shows that they are labeled {"label": "economy_business_finance", "answer": "accept"} or reject so I'm guessing since there is only 1 label it's failing? Do I need to do some manipulation to the database to turn all reject annotations into NOT_LABEL and all accept annotations into LABEL?

Thanks for the background!

I assume that you can run --textcat-multilabel, right?

python -m prodigy train ebf/ --textcat-multilabel ebf_0,ebf_1,ebf_2 --eval-split 0.2 --base-model en_core_web_lg

I definitely understand the point that this is a bit confusing (why would binary classification use a textcat_multilabel?) so maybe such a helper function could do the trick. As the thread above explains, some of this is due to how spaCy handles this and try to match up to it in Prodigy.

A bit of a hack, but another approach would be for the original .jsonl / dataset, to modify every "answer": "reject" by changing the label to "not_economy_business_finance" and switching to "answer": "accept". So in this case if you want to convert original labels from a Prodigy dataset (samp-textcat) to a new Prodigy dataset (samp-textcat-new), you can run this:

from prodigy.components.db import connect

# pull examples from dataset
db = connect()
examples = db.get_dataset("samp-textcat")

# modify change rejects to "not_" as accepts
new_examples = []
for eg in examples:
    if eg["answer"] == "reject":
        eg["label"] = "not_" + str(eg["label"])
        eg["answer"] = "accept"
    new_examples.append(eg)

# create new Prodigy dataset
db.add_dataset("samp-textcat-new", session=True)
db.add_examples(new_examples, datasets=["samp-textcat-new"])

I tried this and running samp-textcat-new and I could run train --textcat when previously I could only run train --textcat-multilabel. Does this work for you?

If you wanted to avoid this from the start, you need to have two labels in your annotation, which you can just add in a second with a not_ prefix.

textcat.teach ebf_0 en_core_web_sm ./data/articles_0.jsonl --label economy_business_finance,not_economy_business_finance --patterns ./ebf/patterns.jsonl

Thanks again for the question! I think there could be small improvements in the future to avoid this problem.

I tried that out and it worked, thanks!

And yea as a beginner, I think it makes for the output of textcat.teach --exclusive to feed directly into train --textcat. Just seems logical as a flow for training a binary classifier.

Thanks again for the help! Also, is there anywhere you could point me to to understand the output of train? Some documentation on LOSS TOK2VEC, LOSSTEXTCAT, CATS_SCORE, SPEED, SCORE would be helpful, thanks!

Screen Shot 2022-07-15 at 2.16.37 PM

Completely agree! I found the same issue while I was trying to create new content for beginners and thought the same. We'll see about improving this in the future.

Good point! Check out this recent post:

I appreciate your point on more documentation! I've made a note on this and will look into creating some content on this topic in the future.

Thanks again for your feedback and reach back out any time we can help!

1 Like

Hi - I'm just chiming in to say please make it more apparent (or offer a cli way to convert accept/reject to labels) with exclusive binary labels. I hit this every so often and have to remember why and then write code to fix my prodigy labels which is annoying.
Lynn

1 Like