Hi @nix411!
Sorry for the delay. It appears that prodigy train --textcat_multilabel
will automatically use all label examples as possibilities, not just those that are provided/relevant.
I just spent some time doing a small experiment to understand. I don't have a perfect answer but I've found some interesting points that can perhaps help us move forward. To really understand this I need to dig into spaCy code on how it handles textcat_multilabel
. Also, it may be worthwhile to understand if it's possible to modify this through a spaCy config file. I'll need to discuss with colleagues to confirm this but hopefully this gives some idea. Also, feel free to try to replicate what I did too to see if you can confirm.
I'm curious - do you have some context on your use case? Wondering what would be the situation in which you need these capabilities.
First try
First, I tried to use your example (only adding an "accept"
key) and then do a second example where all the categories are filled (either 0.0
or 1.0
).
{"id":1,"text":"mention something","cats":{"A":0.0},"answer":"accept"}
{"id":2,"text":"talking about a","cats":{"A":1.0},"answer":"accept"}
{"id":3,"text":"talking about a and not b","cats":{"A":1.0,"B":0.0},"answer":"accept"}
{"id":4,"text":"mention both a and b","cats":{"A":1.0,"B":1.0},"answer":"accept"}
{"id":5,"text":"just b now","cats":{"B":1.0},"answer":"accept"}
Unfortunately I found this isn't a valid json. So I decided an alternative route.
textcat_multilabel
alternative labeling scheme
While using cats
is the typical way to annotate categories, Prodigy can handle an alternative .jsonl
scheme for labels for textcat_multilabel
that uses the "options"
and "accept"
keys to provide the option categories and which were accepted. By default, this is how Prodigy will export textcat_multilabel
annotations (e.g., from textcat.manual
as it uses non-mutually exclusive categories / textcat_multilabel
as default). For example, see #3 in the Prodigy documentation.
What I like with this format is that it at least gives us a way to pass the information you want -- both what labels were selected (accepted) and which were an option (options). The problem with only using cats
format is we can only show what was selected (accepted).
Experiment
Taking your example (where the options are only A for 1/2 and only B for 5), we can rewrite this as:
# multilabel.jsonl
{"id":1,"text":"mention something","options":[{"id":"A"}],"accept":[],"answer":"accept"}
{"id":2,"text":"talking about a","options":[{"id":"A"}],"accept":["A"],"answer":"accept"}
{"id":3,"text":"talking about a and not b","options":[{"id":"A"},{"id":"B"}],"accept":["A"],"answer":"accept"}
{"id":4,"text":"mention both a and b","options":[{"id":"A"},{"id":"B"}],"accept":["A","B"],"answer":"accept"}
{"id":5,"text":"just b now","options":[{"id":"B"}],"accept":["B"],"answer":"accept"}
As a comparison, I also created a second .jsonl
where the options are the same for all five records:
# multilabel2.jsonl
{"id":1,"text":"mention something","options":[{"id":"A"},{"id":"B"}],"accept":[],"answer":"accept"}
{"id":2,"text":"talking about a","options":[{"id":"A"},{"id":"B"}],"accept":["A"],"answer":"accept"}
{"id":3,"text":"talking about a and not b","options":[{"id":"A"},{"id":"B"}],"accept":["A"],"answer":"accept"}
{"id":4,"text":"mention both a and b","options":[{"id":"A"},{"id":"B"}],"accept":["A","B"],"answer":"accept"}
{"id":5,"text":"just b now","options":[{"id":"A"},{"id":"B"}],"accept":["B"],"answer":"accept"}
After loading both of these files as Prodigy datasets via db-in
, I then ran prodigy train
on both (see below). It appears prodigy train
gives the exact same model for training, which leads me to believe the options
isn't used in the model. What's important is that if you remove the options
, it will skip that record. Therefore, I suspect the options
is only being used now for a data validation check but not in the model training. I'm going to make a note to dig into some of the code to confirm.
prodigy train --textcat-multilabel multilabel --label-stats
ℹ Using CPU
========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config
=========================== Initializing pipeline ===========================
[2022-06-08 12:03:19,671] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
- [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
[2022-06-08 12:03:19,682] [INFO] Pipeline: ['textcat_multilabel']
[2022-06-08 12:03:19,684] [INFO] Created vocabulary
[2022-06-08 12:03:19,685] [INFO] Finished initializing nlp object
[2022-06-08 12:03:19,694] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline
============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
- [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
ℹ Pipeline: ['textcat_multilabel']
ℹ Initial learn rate: 0.001
E # LOSS TEXTC... CATS_SCORE SCORE
--- ------ ------------- ---------- ------
0 0 0.03 0.00 0.00
200 200 4.03 0.00 0.00
400 400 1.65 0.00 0.00
600 600 0.82 0.00 0.00
800 800 0.48 0.00 0.00
1000 1000 0.31 0.00 0.00
1200 1200 0.22 0.00 0.00
1400 1400 0.16 0.00 0.00
1600 1600 0.12 0.00 0.00
=========================== Textcat F (per label) ===========================
P R F
A 100.00 100.00 100.00
B 0.00 0.00 0.00
======================== Textcat ROC AUC (per label) ========================
ROC AUC
A None
B None
prodigy train --textcat-multilabel multilabel2 --label-stats
ℹ Using CPU
========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config
=========================== Initializing pipeline ===========================
[2022-06-08 12:03:28,213] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
- [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
[2022-06-08 12:03:28,221] [INFO] Pipeline: ['textcat_multilabel']
[2022-06-08 12:03:28,223] [INFO] Created vocabulary
[2022-06-08 12:03:28,224] [INFO] Finished initializing nlp object
[2022-06-08 12:03:28,227] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline
============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
- [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
ℹ Pipeline: ['textcat_multilabel']
ℹ Initial learn rate: 0.001
E # LOSS TEXTC... CATS_SCORE SCORE
--- ------ ------------- ---------- ------
0 0 0.03 0.00 0.00
200 200 4.03 0.00 0.00
400 400 1.65 0.00 0.00
600 600 0.82 0.00 0.00
800 800 0.48 0.00 0.00
1000 1000 0.31 0.00 0.00
1200 1200 0.22 0.00 0.00
1400 1400 0.16 0.00 0.00
1600 1600 0.12 0.00 0.00
=========================== Textcat F (per label) ===========================
P R F
A 100.00 100.00 100.00
B 0.00 0.00 0.00
======================== Textcat ROC AUC (per label) ========================
ROC AUC
A None
B None