Poor evaluation score when exporting merged ner and textcat data

I have a dataset where I've been annotating both textcat (exclusive) and ner labels simultaneously. I recently started exporting all the data into a single dataset to use with spacy (using data-to-spacy), and I notice that the scores I'm getting for the textcat model are very strange. I'm seeing high R values for the textcat labels, but very low P, almost as if the model is being evaluated as a multilabel model and just applying all the labels in most cases. This is not to do with training a NER and textcat model in the same pipeline, it has to do with the textcat data. Because if I use a textcat-only pipeline config, then I see a stark difference depending on whether I exported textcat only training data, or both ner and textcat. With textcat-only data, I get high P and R.

I'm investigating the issue myself currently, but yet to fully understand what's going on. First possible clue is that the number of exported examples is different depending on if I have textcat or ner+textcat.

============================== Generating data ==============================
Components: ner, textcat
Merging training and evaluation data for 2 components
  - [ner] Training: 10928 | Evaluation: 2731 (20% split)
  - [textcat] Training: 10928 | Evaluation: 2731 (20% split)
Training: 12932 | Evaluation: 4854


============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 10928 | Evaluation: 2731 (20% split)
Training: 10820 | Evaluation: 2727

So there's clearly not 100% overlap between which examples have ner and which have textcat annotations, as well as some duplication maybe. Especially the evaluation split looks strangely non-overlapping. In fact, the ratio 2731/4854=0.56 is pretty close to the low P numbers I'm getting, so it's really quite likely related. I will keep investigating, but wanted to post this here now in case anyone has an idea.

Hi! It sounds like the problem might be related to this issue in spaCy and how "missing" labels are evaluated by the textcat component. The fix will be included in v3.3 of spaCy.

Thanks @ines , that does seem to be the explanation! And if so, then my understanding is that this is only a problem with the evaluation, and not the training? I.e., the model update deals correctly with missing labels? A multilabel textcat model should presumably take examples without any labels to mean that none of the labels apply while a multiclass textcat model should assume that example to contain no information, and not use it for the update.

Yes, that should be the case I think! If you want to, you could also just port over the changes from this PR and hack them into your spaCy installation, since it's mostly juse one file anyway. And then re-run the training and compare the results :slightly_smiling_face:

Hi @ines , after testing the scoring in v3.3.0 and finding out that it didn't fix my issue, I decided to investigate a bit more thoroughly. There's two things going on here actually.

Firstly, when you have a single dataset containing two types of annotation (ner and textcat in my case, annotated together), you have to pass the dataset into data-to-spacy separately for each annotation type: prodigy data-to-spacy out_folder --ner=my_dataset_name --textcat=my_dataset_name, and as explained in the source code, the evaluation split is done on the reader level to ensure that there will be examples with each annotation type in the evaluation dataset. Since the random seed is only set once at the beginning, the random split is different in the ner reader than the textcat reader, resulting in quite low overlap (especially in the evaluation set which tends to be a small fraction of the total). This alone is quite confusing, and is definitely not what you want if you have a single dataset with multiple types of annotation. It can be fixed by resetting the random seed before each reader is used (I verified this), is there any drawback to doing that?

Second issue: in the resulting evaluation dataset, nearly half the examples don't have any textcat labels, and nearly half don't have ner labels. Only a small fraction has both. This causes a problem when it's multiclass textcat (classes exclusive) as a missing label is interpreted as 0.0 (see the above Github issue). It means that the precision score will be horrible since the model will seem to have lots of false positives. I verified that switching the scorer to the multilabel one gives a much more sensible score as a missing label is interpreted as really missing.

I think the second issue needs solving independently, since one should be able to merge two potentially overlapping ner and textcat datasets, and get sensible scores when training a model on it. The first issue perhaps doesn't need solving if the second one is solved, but it's still not how I would want my resulting datasets to look in my case.