Unexpected NER scores / models when training using gold and binary datasets combined in v1.11.1


Since I have both a gold dataset and 2 binary datasets, I wanted to try out the new feature in 1.11 of training on both gold and binary datasets at the same time.

The datasets look as follows:

  • ~700 sentences w gold standard annotations (without missing entities) of entities A and B (almost all of the contain NEs)
  • ~1100 binary annotations of only entity A
  • ~600 binary annotations of only entity B on same sentences as above binary set (100% sentence overlap)

Since I could not find in the documentation if I should specify -NM in this case, I ran the following 4 experiments:

  1. training on gold dataset only
  2. training on gold dataset only with -NM specified (I know this does not make sense but wanted to compare with the below)
  3. training on gold + binary datasets
  4. training on gold + binary datasets with -NM specified

I used the following command:
prodigy train -n datasets -m en_core_sci_lg -L -V (-NM)

Both the resulting F scores, and the actual accuracy of the resulting models looked strange to me.
Regarding F scores, the highest scores were as follows:
1: 0.87
2: 0.86
3: 0.80
4: 0.90

When -NM is not specified, F scores decreased quite a lot (0.87 -> 0.80) so maybe it would indeed be correct to specify -NM?

When specifying -NM F scores increased a bit (0.86->0.90) but not that much, i would have expected more increase using the larger amount of binary annotations vs the gold annotations. (but this might be normal).

The biggest issue i faced is the actual accuracy of the resulting models.
When I use ner.correct using the highest scoring model from 4, the model suggests MANY strange tokens on all sentences: 'and', punctuation or just words that are not NEs at all. I would say only ~30% of predicted entities are correct. The model labels very agressively, sometimes it highlights ~80% of the words of a sentence as NEs. The results are similar for 2 but worse.

The model 1 and 3 seems MUCH better, in line with their high F score. Actually 3 seems better than 1 which would be weird given the F scores.

My questions follow from this are:

  • In this case should i indeed specify -NM?
  • Could it be that the scoring calculation is incorrect when training on gold + binary datasets with or without -NM specified? (maybe because i did not specify a separate gold standard evaluation set?)
  • What could explain the strange "agressive labeling" behavior of the trained models specified with -NM (assuming this flag is correct)?

For reference i also performed above experiments without specifiying the en_core_sci_lg model, and repeated the model-less training also with v1.10.a11 and results are similar, especially the weird behaviour of the trainined models w -NM.

Would be great to get your thoughts on this, thanks a lot!

Kind regards,

Hi! In the case you describe, setting -NM / --ner-missing shouldn't be necessary: Prodigy will detect your binary annotations from the binary datasets provided and represent them accordingly, and you'll want your gold-standard dataset to be interpreted with all unannotated tokens as "not an entity" to take full advantage of that information. If you treat all unannotated tokens as missing, you're basically excluding a bunch of relevant information you already have (the fact that unannotated tokens aren't entities, which is important to learn from).

Yes, I wonder if the evaluation here is part of the problem: If you don't provide a dedicated evaluation set, a percentage of your examples will be held back for evaluation. This includes a mix of all examples, including your gold-standard data and binary annotations. In general, that's fine but depending on the number of examples you have and the binary vs. manual distribution, you may end up with a lot of tokens that you evaluation data doesn't know the answer for (which is further amplified if you treat unannotated tokens as missing).

When spaCy calculates the accuracy score, it will skip tokens with missing information, because these are neither right nor wrong (they could be either – you just don't know):

This makes sense, but it also means that if you end up with a lot of unknowns in your evaluation data, the evaluation may also be less meaningful. So the higher score you see here may not actually indicate a "better" model – just a more sparse evaluation, because you're only seeing the results for tokens you know the answer for.

This is definitely strange and unideal :thinking: One possible explanation kinda ties in with the points above: if your annotations cause the model to overfit on your data or it ends up in a weird state somehow, and all unannotated tokens are unkowns, there's kinda nothing that will keep it from predicting "and" as an entity (which would normally be the case if that token was explicitly annotated as O). And the evaluation wouldn't catch it either, because all of those tokens are unknowns.

We'll definitely investigate this further and see if there's something we can do in Prodigy to work around this. Even if it turns out that there's a logical explanation for the behaviour, it's obviiusly still bad because it leads to these very subtle and unintuitive results. I almost wonder if --ner-missing is obsolete now with the new training workflow and spaCy v3 and we should just get rid of it alltogether :thinking:

Thanks Ines!
I will create a separate gold dataset for evaluation and see what happens.

In that case, the performance of the model that includes binary sets should always (at least slightly) improve right? (as opposed to a apparent decrease now, which might be due to the evaluation method)

I just reran the experiments with a small evaluation set, and they now perform as expected: the models with -NM perform very poor with F scores of ~0.25 (high recall but very low precision as observed), and the gold+binary model performs quite better than the gold model only.

Maybe in the future a warning could be printed when training without a specific eval set, that the f-scores are not reliable.

Hi Tom,

Thanks for rerunning the experiments, it's great to hear that the results are now as expected :slight_smile:

As you've found, and as Ines explained, the -NM really shouldn't be set manually anymore - Prodigy now knows behind the scenes which datasets contain potentially missing information and which don't. We'll update the docs and the API to clarify that this setting has become deprecated from v1.11 onwards. Thanks again for your detailed report & findings!