Hi @n8te!
Thanks for your question and welcome to the Prodigy community
Are you interested in only training in spaCy or Prodigy? It seems you're asking for both so I'll try to provide both answers.
First for spaCy: If you're only interested in spaCy, here's an example of a standard format (see spaCy tests):
TRAIN_DATA_MULTI_LABEL = [
("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
]
Here are more details from spaCy on the details for setting up training.
The rest of my response will assume the question is for Prodigy as this forum is for Prodigy.
If you're interested in format to train in Prodigy, we have several examples in Prodigy Support that can help:
And if you're doing binary classification, sometimes it can be confusing because some examples show textcat_multilabel
. Here's a post where we try to convert binary data so that you can use textcat
training instead:
You can do this but it's optional if you're training in Prodigy. An alternative route is to get the data into a .jsonl
format, then load it as a Prodigy dataset using the db-in
command. Then you can use prodigy train
by pointing to the dataset.
One key point to be careful. spaCy / Prodigy use slightly different terminology for text classification (below from spaCy textcat documentation):
The text categorizer predicts categories over a whole document . and comes in two flavors:
textcat
andtextcat_multilabel
. When you need to predict exactly one true label per document, use thetextcat
which has mutually exclusive labels. If you want to perform multi-label classification and predict zero, one or more true labels per document, use thetextcat_multilabel
component instead. For a binary classification task, you can usetextcat
with two labels ortextcat_multilabel
with one label.
Notice that there's not the term "multiclass". The key difference is whether you want your labels to be mutually exclusive (which you'd use textcat
) or non-multually exclusive (use textcat_multilabel
). This will be important as even after you format and load your data, you will need to select the appropriate type of model that you're training as an argument to your prodigy train
command.
Last, I highly recommend looking at some of the spaCy project templates. There are several for textcat
like:
FYI these typically cover more of spaCy than Prodigy -- however, a few do include the process of loading .jsonl
into Prodigy. Although it's for ner
, there's also a helpful template on Prodigy-spaCy project integration:
Thanks again for your question! I can understand it's sometimes tough to navigate through all of the resources so I wouldn't be surprised if others have the same question. Let me know if you have any follow up questions!