training data format for multiclass textcat

Hi @n8te!

Thanks for your questions!

Yes - for training in spaCy. You're not expected to do this in Prodigy (it's done for you).

Did you see this on Prodigy data format? That's because Prodigy's formatting is slightly different.

Yes - Prodigy is easier. There are a few different ways to convert, but can you get your data into a format like this:

{"text": "How can I get chewy chocolate chip cookies?", "label": "baking"}
{"text": "I want to make cake.", "label": "baking"}
{"text": "Change the order to pancakes.", "label": "substitutions"}
{"text": "Please substitute in bananas.", "label": "substitutions"}
{"text": "Where is the bathroom?", "label": "OTHER"}
{"text": "What's the price of the flowers?", "label": "OTHER"}

This is more similar to what Prodigy produces, hence what prodigy train will accept.

If that data above was in a file named data.jsonl, you can load into Prodigy database with:

python -m prodigy db-in mydata data.jsonl
✔ Created dataset 'mydata' in database SQLite
✔ Imported 6 annotations to 'mydata' (session 2022-08-26_16-02-11) in
database SQLite
Found and keeping existing "answer" in 0 examples

(FYI Ignore that "Found and keeping existing...", see this thread. The previous output confirms you loaded 6 annotations.)

You're almost there!

You should use textcat, not textcat-multilabel, because you want mutually exclusive labeled. Here's prodigy train docs to explain the difference:

Argument Type Description Default
--textcat, -tc str One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets. None
--textcat-multilabel, -tcm str One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets. None

You can then run prodigy train

python -m prodigy train output_dir --textcat mydata

textcat doesn't use --exclusive, that was the problem. This was changed in v1.11.0. When looking at Prodigy Support posts, definitely check the date. We try to update but can't update everything.

After you've trained your model, you can check by running your model after training. If the labels sum up to 1, you have trained for mutually exclusive. If they sum above 1, you have non-mutually exclusive.

import spacy
nlp = spacy.load("output_dir/model-best")
doc = nlp("I want cookies.")
doc.cats
# {'baking': 0.8704647421836853, 'substitutions': 0.09000309556722641, 'OTHER': 0.03953210636973381}

Hope this helps!