Exporting dataset from prodigy and train textcat in spaCy v3

In prodigy, I used the data-to-spacy recipe to export the dataset to a .json file, using -TE. Then in spacy v3, I used the convert command to convert the json to .spacy format. No error during conversion. But when I'm training a textcat pipeline, it always gives CATS_SCORE=100 right the way and all the time, as if it thinks all the data are labelled the same way. (exclusive_classes is set to true in config.cfg)
In the .json file, I can see that the format is like this:
The value is either 0.0 or 1.0.
In cats, does it need the opposite label as well? like
"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}

Hi! How many labels does your data have in total? Do you only have the one label, MyLabel? If your goal is to predict a single exclusive label, then you're right and you would need a second label that the model should predict instead if your main label doesn't apply.

This is definitely something we need to fix going forward in the upcoming version of Prodigy if text classification data with only one label is provided and --textcat-exclusive is set. (It's maybe a bit unideal, but in that case, it should probably just add a label like OTHER automatically and set that to 1.0 if the other label doesn't apply.)

As a workaround, one option could be to just write a quick script that adds a second label to all the cats, which should be pretty easy to do programmatically.

Thanks for the reply. I will give it a try.

Btw, to add to my comment above, if your data only has one label, you can use the textcat_multilabel component instead of the regular textcat component: https://spacy.io/api/textcategorizer

I'm not sure I understand. Did I get this right?

  • textcat requires two labels, one for yes and one for no
  • Prodigy's output only has the yes label and misses the no label
  • one workaround to this is adding the No label to the output and using textcat
  • another workaround is to use textcat_multilabel without changing the Prodigy output

And a follow up question: Does using textcat_multilabel as a workaround have any other implications on the text classification architecture and model performance?

Yes, that's correct. To make the first point more explicity: textcat requires at least two labels, so if your task is binary, that would have to be one for the binary label and one for everything else. But of course, it can also have more labels. The latest spaCy v3.1 will now also raise explicitly if you initialize a textcat components with only one label.

The textcat_multilabel component is a variation of the textcat component. It uses the same architectures by default and the config only really differs in the exclusive_classes setting. The main difference is in the initialization and scoring:

The choice of components will have an impact on the results, but that's mostly due ot the exclusive_classes setting and how the labels are interpreted.

1 Like

Just released v1.11, which now supports separate arguments for --textcat (mutually exclusive categories) and --textcat-multilabel (single label or multiple non-exclusive labels): https://prodi.gy/docs/recipes#train