In prodigy, I used the data-to-spacy recipe to export the dataset to a .json file, using -TE. Then in spacy v3, I used the convert command to convert the json to .spacy format. No error during conversion. But when I'm training a textcat pipeline, it always gives CATS_SCORE=100 right the way and all the time, as if it thinks all the data are labelled the same way. (exclusive_classes is set to true in config.cfg)
In the .json file, I can see that the format is like this:
"cats":[{"label":"MyLabel","value":0.0}]
The value is either 0.0 or 1.0.
In cats, does it need the opposite label as well? like
"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}
Hi! How many labels does your data have in total? Do you only have the one label, MyLabel
? If your goal is to predict a single exclusive label, then you're right and you would need a second label that the model should predict instead if your main label doesn't apply.
This is definitely something we need to fix going forward in the upcoming version of Prodigy if text classification data with only one label is provided and --textcat-exclusive
is set. (It's maybe a bit unideal, but in that case, it should probably just add a label like OTHER
automatically and set that to 1.0
if the other label doesn't apply.)
As a workaround, one option could be to just write a quick script that adds a second label to all the cats
, which should be pretty easy to do programmatically.
Thanks for the reply. I will give it a try.
Btw, to add to my comment above, if your data only has one label, you can use the textcat_multilabel
component instead of the regular textcat
component: https://spacy.io/api/textcategorizer
I'm not sure I understand. Did I get this right?
-
textcat
requires two labels, one for yes and one for no - Prodigy's output only has the yes label and misses the no label
- one workaround to this is adding the No label to the output and using
textcat
- another workaround is to use
textcat_multilabel
without changing the Prodigy output
And a follow up question: Does using textcat_multilabel
as a workaround have any other implications on the text classification architecture and model performance?
Yes, that's correct. To make the first point more explicity: textcat
requires at least two labels, so if your task is binary, that would have to be one for the binary label and one for everything else. But of course, it can also have more labels. The latest spaCy v3.1 will now also raise explicitly if you initialize a textcat
components with only one label.
The textcat_multilabel
component is a variation of the textcat
component. It uses the same architectures by default and the config only really differs in the exclusive_classes
setting. The main difference is in the initialization and scoring:
The choice of components will have an impact on the results, but that's mostly due ot the exclusive_classes
setting and how the labels are interpreted.
Just released v1.11, which now supports separate arguments for --textcat
(mutually exclusive categories) and --textcat-multilabel
(single label or multiple non-exclusive labels): https://prodi.gy/docs/recipes#train