Multilabel JSONL format & active learning

ryanwesslen · August 5, 2022, 1:49pm

My colleague @Jette16 reminded me there's an alternative data format for multilabel. This includes each tag as the "options" and then whichever are the accepted (selected) labels are in the "accept"

{"options": [{"id": "OTHER"}, {"id": "baking"}, {"id":"bread"}, {"id":"chicken"}, {"id":"eggs"}, {"id":"equipment"}, {"id": "food-safety"},{"id":"meat"}, {"id":"sauce"}, {"id":"storage-method"}, {"id":"substitutions"}], "accept":["OTHER","baking"], "text": "How can I get chewy chocolate chip cookies?\n<p>My chocolate chips cookies are always too crisp. How can I get chewy cookies, like those of Starbucks?</p>\n<hr/>\n<p>Thank you to everyone who has answered. So far the tip that had the biggest impact was to chill and rest the dough, however I also increased the brown sugar ratio and increased a bit the butter. Also adding maple syrup helped. </p>\n"}

Perhaps to test out, can you try to use the example above as test.jsonl and run:

prodigy db-in import_data ./test.jsonl --rehash
prodigy train ./model --textcat-multilabel import_data --eval-split 0.2 --base-model blank:en

This will give us a reproducible use case to ensure there aren't other issues. If this works, then try to convert your data to this format.

Topic		Replies	Views
Textcat correct recipe usage , textcat , solved	1	629	September 16, 2020
Is textcat.teach (as out-of-the-box) appropriate with multilabel tasks? textcat , solved	4	337	June 28, 2022
textcat-multilabel annotations format textcat	2	208	January 26, 2024
Multi label tagging usage , textcat	1	1180	September 10, 2018
What is the input format for annotated multi-label text classification data Getting Started textcat , solved	2	768	July 10, 2020

Multilabel JSONL format & active learning

Related topics