training data format for multiclass textcat

Hi @n8te!

Thanks for your question and welcome to the Prodigy community :wave:

Are you interested in only training in spaCy or Prodigy? It seems you're asking for both so I'll try to provide both answers.

First for spaCy: If you're only interested in spaCy, here's an example of a standard format (see spaCy tests):

TRAIN_DATA_MULTI_LABEL = [
    ("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
    ("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
]

Here are more details from spaCy on the details for setting up training.

The rest of my response will assume the question is for Prodigy as this forum is for Prodigy.

If you're interested in format to train in Prodigy, we have several examples in Prodigy Support that can help:

And if you're doing binary classification, sometimes it can be confusing because some examples show textcat_multilabel. Here's a post where we try to convert binary data so that you can use textcat training instead:

You can do this but it's optional if you're training in Prodigy. An alternative route is to get the data into a .jsonl format, then load it as a Prodigy dataset using the db-in command. Then you can use prodigy train by pointing to the dataset.

One key point to be careful. spaCy / Prodigy use slightly different terminology for text classification (below from spaCy textcat documentation):

The text categorizer predicts categories over a whole document . and comes in two flavors: textcat and textcat_multilabel . When you need to predict exactly one true label per document, use the textcat which has mutually exclusive labels. If you want to perform multi-label classification and predict zero, one or more true labels per document, use the textcat_multilabel component instead. For a binary classification task, you can use textcat with two labels or textcat_multilabel with one label.

Notice that there's not the term "multiclass". The key difference is whether you want your labels to be mutually exclusive (which you'd use textcat) or non-multually exclusive (use textcat_multilabel). This will be important as even after you format and load your data, you will need to select the appropriate type of model that you're training as an argument to your prodigy train command.

Last, I highly recommend looking at some of the spaCy project templates. There are several for textcat like:

FYI these typically cover more of spaCy than Prodigy -- however, a few do include the process of loading .jsonl into Prodigy. Although it's for ner, there's also a helpful template on Prodigy-spaCy project integration:

Thanks again for your question! I can understand it's sometimes tough to navigate through all of the resources so I wouldn't be surprised if others have the same question. Let me know if you have any follow up questions!

1 Like