NER evaluation

Dear Prodigy team,

I have recently used prodigy's active learning interface for NER and was quite surprised by the fact that my model returns a higher F1 score on the training than the test data (no active learning involved). There is a single and rare entity type in my data (it appears in ~7% of my sentences) so I started by constructing a large collection of rules to get the active learning started. Then I used ner.teach to annotate 3K sentences (63% sentences with entities) and ner.manual to annotate 2K test sentences (7% sentences with entities).

I first divided (80-20%) the 3K training sentences to evaluate the performance on a dev set, and then applied the model to the test set. I realised that F1 on the dev set was around 86% while 79% on the test set.

I was initially expecting a better performance on the test set due to the uncertainty sampling done in active learning, so I was quite surprised by this result. Am I misunderstanding something? Considering that the annotation was done and checked with the same criteria, would you have any ideas on this?

My initial thoughts were: (1) small number of entities on test set = not representative enough? (2) I noticed some entities with the confidence of 1.0 appeared quite often on the annotation when using active learning, is that meant to be? if not, I guess that can be a major cause

Thanks in advance!

Hi Ferran,

You're right that in general one would expect the active learning to result in harder instances in the training data than the rest data, but there can always be situations where we're surprised. For instance, the active learning can end up "stuck" in a case where it's missing some of the most difficult entities, and simply assigning them a near-zero score. In this situation those entities would not be in the actively learnt sample, but they would be in the manually annotated data.

Active learning works better when precision is more difficult than recall, for instance if you have some ambiguous words that might be entities but sometimes aren't. It's less good when you have a wide spread of cases that are difficult to identify as candidates based on their surface form, as the model may never learn to assign these partial probability.

If the active learning isn't working well, I would suggest using the ner.correct recipe to still get some assistance from the model, while maintaining the flexibility to make sure you don't miss any of the cases.

1 Like

Hi Matthew,

Thanks a lot for your detailed answer. The situation you described does make a lot of sense, especially since some of the initial rules I generated to pre-train the model for active learning might not be enough to give high similarity with some of our entity mention types - and those might be assigned a near-zero score.

I guess there is still a benefit on having used active learning since the number and diversity of entities on the training data is now much higher than having annotated sentences from the raw text - where entities only appear in 7% of sentences.

The ner.correct approach sounds like a very good way to move forward and also allow for comparison between training data: active learning / raw annotation/ combination.

Thank you very much for your answer and congratulations on the amazing work you are all doing at Explosion AI!