textcat_multilabel with only some labels annotated for some examples

Imagine I have two labels A and B. In my workflow I use binary annotation, i.e. given an example and a label I either accept or reject. This way I end up with examples where some of them is only partly annotated - for some I know if each label is accepted or rejected - for others I only know the answer for either A or B.

How is that reflected in the data set to be used for training. I imagine a data set looking like this

{"id": 1, "text": "....", "cats": {"A": 0.0}},
{"id": 2, "text": "....", "cats": {"A": 1.0}},
{"id": 3, "text": "....", "cats": {"A": 1.0, "B": 0.0}},
{"id": 4, "text": "....", "cats": {"A": 1.0, "B": 1.0}},
{"id": 5, "text": "....", "cats": {"B": 1.0}},

which can then be converted into spacys binary format.

Does textcat_multilabel automatically only take the examples relevant for each label? E.g. id 1 and 2 wouldn't be used when training the B label predictions?

Anyone able to answer this?

Hi @nix411!

Sorry for the delay. It appears that prodigy train --textcat_multilabel will automatically use all label examples as possibilities, not just those that are provided/relevant.

I just spent some time doing a small experiment to understand. I don't have a perfect answer but I've found some interesting points that can perhaps help us move forward. To really understand this I need to dig into spaCy code on how it handles textcat_multilabel. Also, it may be worthwhile to understand if it's possible to modify this through a spaCy config file. I'll need to discuss with colleagues to confirm this but hopefully this gives some idea. Also, feel free to try to replicate what I did too to see if you can confirm.

I'm curious - do you have some context on your use case? Wondering what would be the situation in which you need these capabilities.

First try

First, I tried to use your example (only adding an "accept" key) and then do a second example where all the categories are filled (either 0.0 or 1.0).

{"id":1,"text":"mention something","cats":{"A":0.0},"answer":"accept"}
{"id":2,"text":"talking about a","cats":{"A":1.0},"answer":"accept"}
{"id":3,"text":"talking about a and not b","cats":{"A":1.0,"B":0.0},"answer":"accept"}
{"id":4,"text":"mention both a and b","cats":{"A":1.0,"B":1.0},"answer":"accept"}
{"id":5,"text":"just b now","cats":{"B":1.0},"answer":"accept"}

Unfortunately I found this isn't a valid json. So I decided an alternative route.

textcat_multilabel alternative labeling scheme

While using cats is the typical way to annotate categories, Prodigy can handle an alternative .jsonl scheme for labels for textcat_multilabel that uses the "options" and "accept" keys to provide the option categories and which were accepted. By default, this is how Prodigy will export textcat_multilabel annotations (e.g., from textcat.manual as it uses non-mutually exclusive categories / textcat_multilabel as default). For example, see #3 in the Prodigy documentation.

What I like with this format is that it at least gives us a way to pass the information you want -- both what labels were selected (accepted) and which were an option (options). The problem with only using cats format is we can only show what was selected (accepted).

Experiment

Taking your example (where the options are only A for 1/2 and only B for 5), we can rewrite this as:

# multilabel.jsonl
{"id":1,"text":"mention something","options":[{"id":"A"}],"accept":[],"answer":"accept"}
{"id":2,"text":"talking about a","options":[{"id":"A"}],"accept":["A"],"answer":"accept"}
{"id":3,"text":"talking about a and not b","options":[{"id":"A"},{"id":"B"}],"accept":["A"],"answer":"accept"}
{"id":4,"text":"mention both a and b","options":[{"id":"A"},{"id":"B"}],"accept":["A","B"],"answer":"accept"}
{"id":5,"text":"just b now","options":[{"id":"B"}],"accept":["B"],"answer":"accept"}

As a comparison, I also created a second .jsonl where the options are the same for all five records:

# multilabel2.jsonl
{"id":1,"text":"mention something","options":[{"id":"A"},{"id":"B"}],"accept":[],"answer":"accept"}
{"id":2,"text":"talking about a","options":[{"id":"A"},{"id":"B"}],"accept":["A"],"answer":"accept"}
{"id":3,"text":"talking about a and not b","options":[{"id":"A"},{"id":"B"}],"accept":["A"],"answer":"accept"}
{"id":4,"text":"mention both a and b","options":[{"id":"A"},{"id":"B"}],"accept":["A","B"],"answer":"accept"}
{"id":5,"text":"just b now","options":[{"id":"A"},{"id":"B"}],"accept":["B"],"answer":"accept"}

After loading both of these files as Prodigy datasets via db-in, I then ran prodigy train on both (see below). It appears prodigy train gives the exact same model for training, which leads me to believe the options isn't used in the model. What's important is that if you remove the options, it will skip that record. Therefore, I suspect the options is only being used now for a data validation check but not in the model training. I'm going to make a note to dig into some of the code to confirm.

prodigy train --textcat-multilabel multilabel --label-stats
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-06-08 12:03:19,671] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
[2022-06-08 12:03:19,682] [INFO] Pipeline: ['textcat_multilabel']
[2022-06-08 12:03:19,684] [INFO] Created vocabulary
[2022-06-08 12:03:19,685] [INFO] Finished initializing nlp object
[2022-06-08 12:03:19,694] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
ℹ Pipeline: ['textcat_multilabel']
ℹ Initial learn rate: 0.001
E    #       LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  -------------  ----------  ------
  0       0           0.03        0.00    0.00
200     200           4.03        0.00    0.00
400     400           1.65        0.00    0.00
600     600           0.82        0.00    0.00
800     800           0.48        0.00    0.00
1000    1000           0.31        0.00    0.00
1200    1200           0.22        0.00    0.00
1400    1400           0.16        0.00    0.00
1600    1600           0.12        0.00    0.00

=========================== Textcat F (per label) ===========================

         P        R        F
A   100.00   100.00   100.00
B     0.00     0.00     0.00


======================== Textcat ROC AUC (per label) ========================

    ROC AUC
A      None
B      None
prodigy train --textcat-multilabel multilabel2 --label-stats
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-06-08 12:03:28,213] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
[2022-06-08 12:03:28,221] [INFO] Pipeline: ['textcat_multilabel']
[2022-06-08 12:03:28,223] [INFO] Created vocabulary
[2022-06-08 12:03:28,224] [INFO] Finished initializing nlp object
[2022-06-08 12:03:28,227] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (2)
ℹ Pipeline: ['textcat_multilabel']
ℹ Initial learn rate: 0.001
E    #       LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  -------------  ----------  ------
  0       0           0.03        0.00    0.00
200     200           4.03        0.00    0.00
400     400           1.65        0.00    0.00
600     600           0.82        0.00    0.00
800     800           0.48        0.00    0.00
1000    1000           0.31        0.00    0.00
1200    1200           0.22        0.00    0.00
1400    1400           0.16        0.00    0.00
1600    1600           0.12        0.00    0.00

=========================== Textcat F (per label) ===========================

         P        R        F
A   100.00   100.00   100.00
B     0.00     0.00     0.00


======================== Textcat ROC AUC (per label) ========================

    ROC AUC
A      None
B      None

Hi,

Thanks for your response. I was only speaking theoretically here - I’m not seeing a specific issue.

I guess I’m kinda asking if I can expect the same prediction score for label A whether I included training data for label B or not. Or are the predictions mutually independent?

No you can’t. With this you are saying that example 1 and 2 for instance do not have label B. But the fact is that I don’t know if it has label B because I’ve never answered that question. So this is really the core of what my question relates to.

If the question is whether training would be exactly the same when considering just one of the labels instead both of them, then the answer is "no". This is because there will be shared parts in the neural architecture. The tok2vec layer, or any attention layer, would be an example of this. It's hard to say upfront how much of an impact this'll really make, but it is good to observe that spaCy has multiple models that can perform the textcat_multilabel task.

That said, the data that is provided is being used as it's given. There will be no loss calculated for the missing values, and thus those won't actually have an effect on the calculated loss of the model.