I annotated 130 examples using Prodigy for training and 20 others for testing. I used the scorer function
def evaluate(ner_model, examples):
scorer = Scorer()
for input_, annot in examples:
doc_gold_text = ner_model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot['entities'])
pred_value = ner_model(input_)
test_results = evaluate(nermodel , TEST_DATA)
The F,P,R scores from this function is all the same value of 89.28. I am not sure why would it return same score of all 3.
Precision and recall will be the same if the number of predictions is the same as the number of true annotations. If precision and recall are the same, then F-score must be the same value as both of them as well (since F-score is the harmonic mean of the two values).
You probably want to have a look at your predictions and compare them to the gold standard, to see what’s up. It might be that your model only makes mistakes on the entity type, but not the span boundaries, for instance. Or it might be a less interesting coincidence — after all, the evaluation set is quite small.