Multilabel JSONL format & active learning

suryaiitkgp · August 4, 2022, 5:27am

I am trying to import multilabel dataset using db-in recipe, the input data format is

{"text":"lorem ipsum dolor sit amet consectetuer adipiscing elit", "label": "LABEL_0"}
{"text":"lorem ipsum dolor sit amet consectetuer adipiscing elit", "label": ["LABEL_0","LABEL_1"]}
{"text":"lorem ipsum dolor sit amet consectetuer adipiscing elit", "label": ["LABEL_0","LABEL_1","LABEL_2"]}

is this correct?

the training recipe is not considering it as multilabel, I am using the recipe
prodigy train ./model --textcat-multilabel dataset_name --eval-split 0.2 --base-model blank:en

Also, Is active learning recipe available for multi-label dataset?

ryanwesslen · August 4, 2022, 1:52pm

hi @suryaiitkgp!

Thanks for your question.

Try a format like this:

{"cats": {"OTHER": 1.0, "baking": 1.0, "bread": 0.0, "chicken": 0.0, "eggs": 0.0, "equipment": 0.0, "food-safety": 0.0, "meat": 0.0, "sauce": 0.0, "storage-method": 0.0, "substitutions": 0.0}, "meta": {"id": "1"}, "text": "How can I get chewy chocolate chip cookies?\n<p>My chocolate chips cookies are always too crisp. How can I get chewy cookies, like those of Starbucks?</p>\n<hr/>\n<p>Thank you to everyone who has answered. So far the tip that had the biggest impact was to chill and rest the dough, however I also increased the brown sugar ratio and increased a bit the butter. Also adding maple syrup helped. </p>\n"}

This is consistent with spaCy tests and we have a multilabel project like this as a template (look in the assets folders for how the data is set up). That project is using spaCy to train but that's really what prodigy train is doing underneath.

Hopefully now you should be able to run prodigy train to then get your initial model into the ./model folder.

Yes. Once you have your first model, then you can run the textcat.teach recipe where you specify the input model (./model). Be sure to choose which labels you want to use.

Also, I'd recommend reading through the Text Classification documentation where it can provide more details on Active Learning.

Hope this helps!

suryaiitkgp · August 4, 2022, 6:15pm

Thank you @ryanwesslen for your response, however, after annotation my dataset with the format you've shared, I am getting following error while training

I've used this command for model training

prodigy train ./model --textcat-multilabel demo_dataset --eval-split 0.2 --base-model blank:en**

ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-08-04 18:10:25,215] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 302 | Evaluation: 74 (20% split)
✘ Invalid data for component 'textcat'

cats str type expected

{'cats': {'LABEL_0': 1.0, 'LABEL_1': 0.0, 'LABEL_2': 0.0, 'LABEL_3': 0.0, 'LABEL_4': 0.0, 'LABEL_5': 0.0}, 'text': 'I wish you a very happy and prosperous Holi. Its a festival of colours we celebrate in GPE and hence taking a break. If its very urgent, and can not wait, then ring me on: + 91 XXXXXXXXX.', '_input_hash': -889657000, '_task_hash': -1196020483, 'answer': 'accept'}

The sample of my annotation format is

{"cats":{"LABEL_0":1.0,"LABEL_1":0.0,"LABEL_2":0.0,"LABEL_3":0.0,"LABEL_4":0.0,"LABEL_5":0.0}, "text": "I am OOO."}

ryanwesslen · August 5, 2022, 1:49pm

hi @suryaiitkgp !

My colleague @Jette16 reminded me there's an alternative data format for multilabel. This includes each tag as the "options" and then whichever are the accepted (selected) labels are in the "accept"

{"options": [{"id": "OTHER"}, {"id": "baking"}, {"id":"bread"}, {"id":"chicken"}, {"id":"eggs"}, {"id":"equipment"}, {"id": "food-safety"},{"id":"meat"}, {"id":"sauce"}, {"id":"storage-method"}, {"id":"substitutions"}], "accept":["OTHER","baking"], "text": "How can I get chewy chocolate chip cookies?\n<p>My chocolate chips cookies are always too crisp. How can I get chewy cookies, like those of Starbucks?</p>\n<hr/>\n<p>Thank you to everyone who has answered. So far the tip that had the biggest impact was to chill and rest the dough, however I also increased the brown sugar ratio and increased a bit the butter. Also adding maple syrup helped. </p>\n"}

Perhaps to test out, can you try to use the example above as test.jsonl and run:

prodigy db-in import_data ./test.jsonl --rehash
prodigy train ./model --textcat-multilabel import_data --eval-split 0.2 --base-model blank:en

This will give us a reproducible use case to ensure there aren't other issues. If this works, then try to convert your data to this format.

suryaiitkgp · August 9, 2022, 5:10pm

thank you ryanwesslen
It is working for our usecase.

Topic		Replies	Views
What is the input format for annotated multi-label text classification data Getting Started textcat , solved	2	769	July 10, 2020
training data format for multiclass textcat Getting Started usage , textcat	7	1562	August 29, 2022
textcat_multilabel with only some labels annotated for some examples	5	377	June 14, 2022
textcat-multilabel annotations format textcat	2	209	January 26, 2024
Custom multilabel categorization recipe textcat , spacy , front-end , solved	12	6278	August 3, 2020

Multilabel JSONL format & active learning

Related topics