training data format for multiclass textcat

ryanwesslen · August 26, 2022, 1:50pm

Thanks for your question and welcome to the Prodigy community

Are you interested in only training in spaCy or Prodigy? It seems you're asking for both so I'll try to provide both answers.

First for spaCy: If you're only interested in spaCy, here's an example of a standard format (see spaCy tests):

TRAIN_DATA_MULTI_LABEL = [
    ("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
    ("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
]

Here are more details from spaCy on the details for setting up training.

The rest of my response will assume the question is for Prodigy as this forum is for Prodigy.

If you're interested in format to train in Prodigy, we have several examples in Prodigy Support that can help:

And if you're doing binary classification, sometimes it can be confusing because some examples show textcat_multilabel. Here's a post where we try to convert binary data so that you can use textcat training instead:

You can do this but it's optional if you're training in Prodigy. An alternative route is to get the data into a .jsonl format, then load it as a Prodigy dataset using the db-in command. Then you can use prodigy train by pointing to the dataset.

One key point to be careful. spaCy / Prodigy use slightly different terminology for text classification (below from spaCy textcat documentation):

The text categorizer predicts categories over a whole document . and comes in two flavors: textcat and textcat_multilabel . When you need to predict exactly one true label per document, use the textcat which has mutually exclusive labels. If you want to perform multi-label classification and predict zero, one or more true labels per document, use the textcat_multilabel component instead. For a binary classification task, you can use textcat with two labels or textcat_multilabel with one label.

Notice that there's not the term "multiclass". The key difference is whether you want your labels to be mutually exclusive (which you'd use textcat) or non-multually exclusive (use textcat_multilabel). This will be important as even after you format and load your data, you will need to select the appropriate type of model that you're training as an argument to your prodigy train command.

Last, I highly recommend looking at some of the spaCy project templates. There are several for textcat like:

FYI these typically cover more of spaCy than Prodigy -- however, a few do include the process of loading .jsonl into Prodigy. Although it's for ner, there's also a helpful template on Prodigy-spaCy project integration:

Thanks again for your question! I can understand it's sometimes tough to navigate through all of the resources so I wouldn't be surprised if others have the same question. Let me know if you have any follow up questions!

Topic		Replies	Views
textcat_multilabel with only some labels annotated for some examples	5	377	June 14, 2022
What is the input format for annotated multi-label text classification data Getting Started textcat , solved	2	769	July 10, 2020
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	895	August 12, 2021
Textcat - teach to train. usage , textcat	2	555	September 1, 2022
Custom multilabel categorization recipe textcat , spacy , front-end , solved	12	6278	August 3, 2020

training data format for multiclass textcat

Related topics