Poor evaluation score when exporting merged ner and textcat data

einarbmag · March 30, 2022, 9:09am

I have a dataset where I've been annotating both textcat (exclusive) and ner labels simultaneously. I recently started exporting all the data into a single dataset to use with spacy (using data-to-spacy), and I notice that the scores I'm getting for the textcat model are very strange. I'm seeing high R values for the textcat labels, but very low P, almost as if the model is being evaluated as a multilabel model and just applying all the labels in most cases. This is not to do with training a NER and textcat model in the same pipeline, it has to do with the textcat data. Because if I use a textcat-only pipeline config, then I see a stark difference depending on whether I exported textcat only training data, or both ner and textcat. With textcat-only data, I get high P and R.

I'm investigating the issue myself currently, but yet to fully understand what's going on. First possible clue is that the number of exported examples is different depending on if I have textcat or ner+textcat.

============================== Generating data ==============================
Components: ner, textcat
Merging training and evaluation data for 2 components
  - [ner] Training: 10928 | Evaluation: 2731 (20% split)
  - [textcat] Training: 10928 | Evaluation: 2731 (20% split)
Training: 12932 | Evaluation: 4854

vs.

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 10928 | Evaluation: 2731 (20% split)
Training: 10820 | Evaluation: 2727

So there's clearly not 100% overlap between which examples have ner and which have textcat annotations, as well as some duplication maybe. Especially the evaluation split looks strangely non-overlapping. In fact, the ratio 2731/4854=0.56 is pretty close to the low P numbers I'm getting, so it's really quite likely related. I will keep investigating, but wanted to post this here now in case anyone has an idea.

ines · April 1, 2022, 1:02pm

Hi! It sounds like the problem might be related to this issue in spaCy and how "missing" labels are evaluated by the textcat component. The fix will be included in v3.3 of spaCy.

github.com/explosion/spaCy

Fix Scorer.score_cats for missing labels

explosion:develop ← flotothemoon:master

opened 07:37PM - 12 Oct 21 UTC

flotothemoon

+103 -19

Fix Scorer.score_cats returning incorrect scores if gold is missing labels. … ## Description The current logic for ignoring missing labels is incorrect: - As noted by the comment, it assumes that `gold_score is None` if the corresponding gold label is missing. - It uses a default return value when accessing `gold_cats` so it won't be `None` even if the value is missing. ### Types of change Bug fix. ## Checklist - [X] I confirm that I have the right to submit this contribution under the project's MIT license. - [X] I ran the (relevant) tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. ## Notes This bug took me a few hours to find so I wanted to submit a quick PR to fix it for everyone. In a perfect world, this PR would include tests for `score_cats` (there are none right now).

einarbmag · April 2, 2022, 11:59am

Thanks @ines , that does seem to be the explanation! And if so, then my understanding is that this is only a problem with the evaluation, and not the training? I.e., the model update deals correctly with missing labels? A multilabel textcat model should presumably take examples without any labels to mean that none of the labels apply while a multiclass textcat model should assume that example to contain no information, and not use it for the update.

ines · April 6, 2022, 11:51am

Yes, that should be the case I think! If you want to, you could also just port over the changes from this PR and hack them into your spaCy installation, since it's mostly juse one file anyway. And then re-run the training and compare the results

einarbmag · May 9, 2022, 4:46pm

Hi @ines , after testing the scoring in v3.3.0 and finding out that it didn't fix my issue, I decided to investigate a bit more thoroughly. There's two things going on here actually.

Firstly, when you have a single dataset containing two types of annotation (ner and textcat in my case, annotated together), you have to pass the dataset into data-to-spacy separately for each annotation type: prodigy data-to-spacy out_folder --ner=my_dataset_name --textcat=my_dataset_name, and as explained in the source code, the evaluation split is done on the reader level to ensure that there will be examples with each annotation type in the evaluation dataset. Since the random seed is only set once at the beginning, the random split is different in the ner reader than the textcat reader, resulting in quite low overlap (especially in the evaluation set which tends to be a small fraction of the total). This alone is quite confusing, and is definitely not what you want if you have a single dataset with multiple types of annotation. It can be fixed by resetting the random seed before each reader is used (I verified this), is there any drawback to doing that?

Second issue: in the resulting evaluation dataset, nearly half the examples don't have any textcat labels, and nearly half don't have ner labels. Only a small fraction has both. This causes a problem when it's multiclass textcat (classes exclusive) as a missing label is interpreted as 0.0 (see the above Github issue). It means that the precision score will be horrible since the model will seem to have lots of false positives. I verified that switching the scorer to the multilabel one gives a much more sensible score as a missing label is interpreted as really missing.

I think the second issue needs solving independently, since one should be able to merge two potentially overlapping ner and textcat datasets, and get sensible scores when training a model on it. The first issue perhaps doesn't need solving if the second one is solved, but it's still not how I would want my resulting datasets to look in my case.

Topic		Replies	Views
Why getting better result in textcat-multilabel than textcat?	13	327	September 11, 2023
evaluate text classification model using spacy evaluate? textcat , spacy , solved	2	735	February 14, 2020
Can't improve textcat model performance textcat	2	389	May 3, 2020
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	895	August 12, 2021
Use textcat and textcat_multilabel in the same model textcat , spacy	1	347	May 19, 2022

Poor evaluation score when exporting merged ner and textcat data

Related topics