Exporting dataset from prodigy and train textcat in spaCy v3

teresa · July 2, 2021, 11:31pm

In prodigy, I used the data-to-spacy recipe to export the dataset to a .json file, using -TE. Then in spacy v3, I used the convert command to convert the json to .spacy format. No error during conversion. But when I'm training a textcat pipeline, it always gives CATS_SCORE=100 right the way and all the time, as if it thinks all the data are labelled the same way. (exclusive_classes is set to true in config.cfg)
In the .json file, I can see that the format is like this:
"cats":[{"label":"MyLabel","value":0.0}]
The value is either 0.0 or 1.0.
In cats, does it need the opposite label as well? like
"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}

ines · July 5, 2021, 12:38am

Hi! How many labels does your data have in total? Do you only have the one label, MyLabel? If your goal is to predict a single exclusive label, then you're right and you would need a second label that the model should predict instead if your main label doesn't apply.

This is definitely something we need to fix going forward in the upcoming version of Prodigy if text classification data with only one label is provided and --textcat-exclusive is set. (It's maybe a bit unideal, but in that case, it should probably just add a label like OTHER automatically and set that to 1.0 if the other label doesn't apply.)

As a workaround, one option could be to just write a quick script that adds a second label to all the cats, which should be pretty easy to do programmatically.

teresa · July 6, 2021, 12:34pm

Thanks for the reply. I will give it a try.

ines · July 7, 2021, 1:17am

Btw, to add to my comment above, if your data only has one label, you can use the textcat_multilabel component instead of the regular textcat component: https://spacy.io/api/textcategorizer

psimm · July 7, 2021, 3:16pm

I'm not sure I understand. Did I get this right?

textcat requires two labels, one for yes and one for no
Prodigy's output only has the yes label and misses the no label
one workaround to this is adding the No label to the output and using textcat
another workaround is to use textcat_multilabel without changing the Prodigy output

And a follow up question: Does using textcat_multilabel as a workaround have any other implications on the text classification architecture and model performance?

ines · July 8, 2021, 4:11am

Yes, that's correct. To make the first point more explicity: textcat requires at least two labels, so if your task is binary, that would have to be one for the binary label and one for everything else. But of course, it can also have more labels. The latest spaCy v3.1 will now also raise explicitly if you initialize a textcat components with only one label.

The textcat_multilabel component is a variation of the textcat component. It uses the same architectures by default and the config only really differs in the exclusive_classes setting. The main difference is in the initialization and scoring:

github.com

explosion/spaCy/blob/master/spacy/pipeline/textcat_multilabel.py

from itertools import islice
from typing import Any, Callable, Dict, Iterable, List, Optional

from thinc.api import Config, Model
from thinc.types import Floats2d

from ..errors import Errors
from ..language import Language
from ..scorer import Scorer
from ..tokens import Doc
from ..training import Example, validate_get_examples
from ..util import registry
from ..vocab import Vocab
from .textcat import TextCategorizer

multi_label_default_config = """
[model]
@architectures = "spacy.TextCatEnsemble.v2"

[model.tok2vec]

This file has been truncated. show original

The choice of components will have an impact on the results, but that's mostly due ot the exclusive_classes setting and how the labels are interpreted.

ines · August 12, 2021, 12:10pm

Just released v1.11, which now supports separate arguments for --textcat (mutually exclusive categories) and --textcat-multilabel (single label or multiple non-exclusive labels): https://prodi.gy/docs/recipes#train

Topic		Replies	Views
Don't understand the label files from data-to-spacy usage , textcat	2	510	February 5, 2022
training data format for multiclass textcat Getting Started usage , textcat	7	1540	August 29, 2022
mutually exclusive classes and textcat.batch-train usage , textcat	5	727	July 1, 2019
Train binary textcat in Prodigy Nightly textcat , done , nightly	3	773	July 19, 2021
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020

Exporting dataset from prodigy and train textcat in spaCy v3

Related topics