I want to evaluate the precision and recall of an NER model on an annotated dataset. The recipe would look like this
prodigy ner.pr -F ner_pr.py evaluation-dataset model --label MY_LABEL --threshold 0.5
Precision 0.8333 Recall 0.9091 F-score 0.8696
The model predicts named entities in the text in the evaluation dataset. Each entity predicted with a score above the threshold is compared to the true entities in the dataset to generate precision and recall statistics.
I think I need to use a prodigy.models.ner.EntityRecognizer
object. I’ve been looking through the documentation and sample code, but haven’t figured out how to do this. Here’s what I have written so far.
import spacy
from prodigy.components.db import connect
from prodigy.core import recipe, recipe_args
from prodigy.models.ner import EntityRecognizer
from prodigy.util import log
DB = connect()
@recipe("ner.pr",
dataset=recipe_args["dataset"],
spacy_model=recipe_args["spacy_model"],
label=recipe_args["entity_label"],
threshold=("detection threshold", "option", "t", float))
def precision_recall(dataset, spacy_model, label=None, threshold=0.5):
"""
Calculate precision and recall of NER predictions.
"""
# I don't know what to do here.
def evaluate(model, samples, label, threshold):
return 10, 2, 1
log("RECIPE: Starting recipe ner.pr", locals())
model = EntityRecognizer(spacy.load(spacy_model), label=label)
log('RECIPE: Initialised EntityRecognizer with model {}'.format(spacy_model), model.nlp.meta)
samples = DB.get_dataset(dataset)
tp, fp, fn = evaluate(model, samples, threshold)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_score = 2 * (precision * recall) / (precision + recall)
print("Precision {:0.4f}\tRecall {:0.4f}\tF-score {:0.4f}".format(precision, recall, f_score))
How do I write evaluate(model, samples, threshold)
so that it actually calculates the true positives, false positives, and false negatives?
I can do most of this with the spaCy model, but I don’t know how to get scores so I can’t incorporate the threshold.
def evaluate(model, samples, label, threshold):
tp = fp = fn = 0
for sample in samples:
truth = set((span["start"], span["end"]) for span in sample["spans"] if span["label"] == label)
hypotheses = set((entity.start_char, entity.end_char)
for entity in model.nlp(sample["text"]).ents if entity.label_ == label)
tp += len(truth.intersection(hypotheses))
fp += len(hypotheses - truth)
fn += len(truth - hypotheses)
return tp, fp, fn
I thought this was what Doc.cats
was for but here’s what I get from that attribute on a document containing a GPE
>>> nlp = spacy.load("en")
>>> nlp("Hello America").cats
{}
This is spaCy version 2.0.5.