training data format for multiclass textcat

ryanwesslen · August 26, 2022, 8:20pm

Thanks for your questions!

Yes - for training in spaCy. You're not expected to do this in Prodigy (it's done for you).

Did you see this on Prodigy data format? That's because Prodigy's formatting is slightly different.

Yes - Prodigy is easier. There are a few different ways to convert, but can you get your data into a format like this:

{"text": "How can I get chewy chocolate chip cookies?", "label": "baking"}
{"text": "I want to make cake.", "label": "baking"}
{"text": "Change the order to pancakes.", "label": "substitutions"}
{"text": "Please substitute in bananas.", "label": "substitutions"}
{"text": "Where is the bathroom?", "label": "OTHER"}
{"text": "What's the price of the flowers?", "label": "OTHER"}

This is more similar to what Prodigy produces, hence what prodigy train will accept.

If that data above was in a file named data.jsonl, you can load into Prodigy database with:

python -m prodigy db-in mydata data.jsonl
✔ Created dataset 'mydata' in database SQLite
✔ Imported 6 annotations to 'mydata' (session 2022-08-26_16-02-11) in
database SQLite
Found and keeping existing "answer" in 0 examples

(FYI Ignore that "Found and keeping existing...", see this thread. The previous output confirms you loaded 6 annotations.)

You're almost there!

You should use textcat, not textcat-multilabel, because you want mutually exclusive labeled. Here's prodigy train docs to explain the difference:

Argument	Type	Description	Default
--textcat, -tc	str	One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets.	None
--textcat-multilabel, -tcm	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets.	None

You can then run prodigy train

python -m prodigy train output_dir --textcat mydata

textcat doesn't use --exclusive, that was the problem. This was changed in v1.11.0. When looking at Prodigy Support posts, definitely check the date. We try to update but can't update everything.

After you've trained your model, you can check by running your model after training. If the labels sum up to 1, you have trained for mutually exclusive. If they sum above 1, you have non-mutually exclusive.

import spacy
nlp = spacy.load("output_dir/model-best")
doc = nlp("I want cookies.")
doc.cats
# {'baking': 0.8704647421836853, 'substitutions': 0.09000309556722641, 'OTHER': 0.03953210636973381}

Hope this helps!

Topic		Replies	Views
textcat_multilabel with only some labels annotated for some examples	5	377	June 14, 2022
Don't understand the label files from data-to-spacy usage , textcat	2	510	February 5, 2022
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	895	August 12, 2021
Textcat model with multiple classes usage , textcat	5	1536	November 1, 2019
How can I training a textcat have thousands label. usage , textcat , spacy	2	1328	June 20, 2019

training data format for multiclass textcat

Related topics