Dear Prodigy team,
I have recently used prodigy's active learning interface for NER and was quite surprised by the fact that my model returns a higher F1 score on the training than the test data (no active learning involved). There is a single and rare entity type in my data (it appears in ~7% of my sentences) so I started by constructing a large collection of rules to get the active learning started. Then I used ner.teach to annotate 3K sentences (63% sentences with entities) and ner.manual to annotate 2K test sentences (7% sentences with entities).
I first divided (80-20%) the 3K training sentences to evaluate the performance on a dev set, and then applied the model to the test set. I realised that F1 on the dev set was around 86% while 79% on the test set.
I was initially expecting a better performance on the test set due to the uncertainty sampling done in active learning, so I was quite surprised by this result. Am I misunderstanding something? Considering that the annotation was done and checked with the same criteria, would you have any ideas on this?
My initial thoughts were: (1) small number of entities on test set = not representative enough? (2) I noticed some entities with the confidence of 1.0 appeared quite often on the annotation when using active learning, is that meant to be? if not, I guess that can be a major cause
Thanks in advance!