training data format for multiclass textcat

Hello,

I am very new to spacy and i am trying to learn how to train a multiclass categoriser with 17 labels so that each text is assigned only one label. I have the data in csv and also have onehot encoded it, but i am struggling to see what format spacy will accept it in. I think it needs to be in json format that is then coverted to the spacy .doc format, but i can;t see any simple examples or tutorials showing how this is done. I also have prodigy from a while back doing NER work. I tried prodigy as i think it can accept text and cats in a siple text file, but again, i can;t seem to find a simple example of how this is achieved. I see many examples online where there are data with text and label in simple format, but there seems to be a lack of information on the process of getting this into spacy. Any help is hugely appreciated!

Hi @n8te!

Thanks for your question and welcome to the Prodigy community :wave:

Are you interested in only training in spaCy or Prodigy? It seems you're asking for both so I'll try to provide both answers.

First for spaCy: If you're only interested in spaCy, here's an example of a standard format (see spaCy tests):

TRAIN_DATA_MULTI_LABEL = [
    ("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
    ("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
]

Here are more details from spaCy on the details for setting up training.

The rest of my response will assume the question is for Prodigy as this forum is for Prodigy.

If you're interested in format to train in Prodigy, we have several examples in Prodigy Support that can help:

And if you're doing binary classification, sometimes it can be confusing because some examples show textcat_multilabel. Here's a post where we try to convert binary data so that you can use textcat training instead:

You can do this but it's optional if you're training in Prodigy. An alternative route is to get the data into a .jsonl format, then load it as a Prodigy dataset using the db-in command. Then you can use prodigy train by pointing to the dataset.

One key point to be careful. spaCy / Prodigy use slightly different terminology for text classification (below from spaCy textcat documentation):

The text categorizer predicts categories over a whole document . and comes in two flavors: textcat and textcat_multilabel . When you need to predict exactly one true label per document, use the textcat which has mutually exclusive labels. If you want to perform multi-label classification and predict zero, one or more true labels per document, use the textcat_multilabel component instead. For a binary classification task, you can use textcat with two labels or textcat_multilabel with one label.

Notice that there's not the term "multiclass". The key difference is whether you want your labels to be mutually exclusive (which you'd use textcat) or non-multually exclusive (use textcat_multilabel). This will be important as even after you format and load your data, you will need to select the appropriate type of model that you're training as an argument to your prodigy train command.

Last, I highly recommend looking at some of the spaCy project templates. There are several for textcat like:

FYI these typically cover more of spaCy than Prodigy -- however, a few do include the process of loading .jsonl into Prodigy. Although it's for ner, there's also a helpful template on Prodigy-spaCy project integration:

Thanks again for your question! I can understand it's sometimes tough to navigate through all of the resources so I wouldn't be surprised if others have the same question. Let me know if you have any follow up questions!

Hi @ryanwesslen,

Many thanks for the quick response and apologies for such newbie questions! So i am trying to train in either spacy or prodigy- whichever is easier. I have data that i want to label exclusively. i have 17 labels. The spacy data format you gave - is that what is fed into the "convert" CLI function to create the .spacy format? I see a different format on the site which has the text field labeled as "text". I have my text and label data ready after doing onehotencoding - i just need to get it in the right format to run through spacy to train. If prodigy is easier then i can use that too. I ran a test with some labeled data i created using prodigy and manually labeling my data, but this thre an error using the "--exclusive" command when sing "textcat" and said i needed to use "textcat-multilabel" instead. I ran that and tested the model which output all the labels with the corresponding scores. Is that the correct use for exclusive labelling for more than 2 labels?

many thanks!

Hi @n8te!

Thanks for your questions!

Yes - for training in spaCy. You're not expected to do this in Prodigy (it's done for you).

Did you see this on Prodigy data format? That's because Prodigy's formatting is slightly different.

Yes - Prodigy is easier. There are a few different ways to convert, but can you get your data into a format like this:

{"text": "How can I get chewy chocolate chip cookies?", "label": "baking"}
{"text": "I want to make cake.", "label": "baking"}
{"text": "Change the order to pancakes.", "label": "substitutions"}
{"text": "Please substitute in bananas.", "label": "substitutions"}
{"text": "Where is the bathroom?", "label": "OTHER"}
{"text": "What's the price of the flowers?", "label": "OTHER"}

This is more similar to what Prodigy produces, hence what prodigy train will accept.

If that data above was in a file named data.jsonl, you can load into Prodigy database with:

python -m prodigy db-in mydata data.jsonl
✔ Created dataset 'mydata' in database SQLite
✔ Imported 6 annotations to 'mydata' (session 2022-08-26_16-02-11) in
database SQLite
Found and keeping existing "answer" in 0 examples

(FYI Ignore that "Found and keeping existing...", see this thread. The previous output confirms you loaded 6 annotations.)

You're almost there!

You should use textcat, not textcat-multilabel, because you want mutually exclusive labeled. Here's prodigy train docs to explain the difference:

Argument Type Description Default
--textcat, -tc str One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets. None
--textcat-multilabel, -tcm str One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets. None

You can then run prodigy train

python -m prodigy train output_dir --textcat mydata

textcat doesn't use --exclusive, that was the problem. This was changed in v1.11.0. When looking at Prodigy Support posts, definitely check the date. We try to update but can't update everything.

After you've trained your model, you can check by running your model after training. If the labels sum up to 1, you have trained for mutually exclusive. If they sum above 1, you have non-mutually exclusive.

import spacy
nlp = spacy.load("output_dir/model-best")
doc = nlp("I want cookies.")
doc.cats
# {'baking': 0.8704647421836853, 'substitutions': 0.09000309556722641, 'OTHER': 0.03953210636973381}

Hope this helps!

@ryanwesslen

That looks like it's done it - i was getting lost with all the different data formats and examples. I've run your example, trained in prodigy and also exported to .spacy and trained there too. I'll get my data in this format and get on with it. THANK YOU VERY MUCH! Huge help!

One last question - i assume i can just pick the output with the highest score for the single label for my needs or is there a way to force the model to output only one label - the highest scoring?

again, many thanks for your help

Yes, the simplest would be to take the highest score as your prediction.

However, in most application you may want to modify the threshold depending on your precision-recall accuracy goals. You can likely find related posts on more details or I can provide more suggestions next week. Let me know if you're interested!

@ryanwesslen

Ok thanks, i'll look into this. I ran a model and have a question on the output. Some of my scores are over 1. My understanding is the score is a percentage, so getting 0.98 would be 98%, but i have some labels that come back as 7.something or 3.something. Is this something that is expected?

hi @n8te!

From spaCy documentation:

For textcat (exclusive categories), the scores will sum to 1, while for textcat_multilabel there is no particular guarantee about their sum.

If your model scores (predictions) sum above 1, then you may have trained a textcat_multilabel model (non-mutually exclusive categories), not a textcat (mutually exclusive categories).

Can you check again to make sure how you ran your model?

You can do so by looking at your model's meta.json and config.cfg files in either output_dir/model-best or output_dir/model-last.