Evaluating Precision and Recall of NER

I want to evaluate the precision and recall of an NER model on an annotated dataset. The recipe would look like this

prodigy ner.pr -F ner_pr.py evaluation-dataset model --label MY_LABEL --threshold 0.5
Precision 0.8333        Recall 0.9091   F-score 0.8696

The model predicts named entities in the text in the evaluation dataset. Each entity predicted with a score above the threshold is compared to the true entities in the dataset to generate precision and recall statistics.

I think I need to use a prodigy.models.ner.EntityRecognizer object. I’ve been looking through the documentation and sample code, but haven’t figured out how to do this. Here’s what I have written so far.

import spacy
from prodigy.components.db import connect
from prodigy.core import recipe, recipe_args
from prodigy.models.ner import EntityRecognizer
from prodigy.util import log

DB = connect()


@recipe("ner.pr",
        dataset=recipe_args["dataset"],
        spacy_model=recipe_args["spacy_model"],
        label=recipe_args["entity_label"],
        threshold=("detection threshold", "option", "t", float))
def precision_recall(dataset, spacy_model, label=None, threshold=0.5):
    """
    Calculate precision and recall of NER predictions.
    """

    # I don't know what to do here.
    def evaluate(model, samples, label, threshold):
        return 10, 2, 1

    log("RECIPE: Starting recipe ner.pr", locals())
    model = EntityRecognizer(spacy.load(spacy_model), label=label)
    log('RECIPE: Initialised EntityRecognizer with model {}'.format(spacy_model), model.nlp.meta)
    samples = DB.get_dataset(dataset)
    tp, fp, fn = evaluate(model, samples, threshold)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    print("Precision {:0.4f}\tRecall {:0.4f}\tF-score {:0.4f}".format(precision, recall, f_score))

How do I write evaluate(model, samples, threshold) so that it actually calculates the true positives, false positives, and false negatives?

I can do most of this with the spaCy model, but I don’t know how to get scores so I can’t incorporate the threshold.

def evaluate(model, samples, label, threshold):
    tp = fp = fn = 0
    for sample in samples:
        truth = set((span["start"], span["end"]) for span in sample["spans"] if span["label"] == label)
        hypotheses = set((entity.start_char, entity.end_char)
                         for entity in model.nlp(sample["text"]).ents if entity.label_ == label)
        tp += len(truth.intersection(hypotheses))
        fp += len(hypotheses - truth)
        fn += len(truth - hypotheses)
    return tp, fp, fn

I thought this was what Doc.cats was for but here’s what I get from that attribute on a document containing a GPE

>>> nlp = spacy.load("en")
>>> nlp("Hello America").cats
{}

This is spaCy version 2.0.5.

Most people use the functions in scikit-learn for these things. Personally I don’t, because none of our libraries depend on sklearn. So I usually implement P/R/F close to where I’m using it, as it’s a pretty simple metric.

There’s a Scorer() class in spaCy that you might find useful if you don’t want to use scikit-learn: https://github.com/explosion/spaCy/blob/master/spacy/scorer.py . You can also use the nlp.evaluate() method: https://github.com/explosion/spaCy/blob/v2.0.5/spacy/language.py#L459

I’m asking something a little different. Like you I find it easiest to write my own P/R code. (And it appears that Scorer and nlp.evaluate are utilities that calculate P/R from the spaCy data structures.) But additionally I want to calculate P/R at a given threshold. So the model I’m evaluating is only considered to hypothesize an entity if its confidence score for that entity is above a given threshold t. The goal is to run this for a range of thresholds and draw an f-score ROC curve.

The part I can’t figure out is how to get the model to return scores for the entities it hypothesizes. I thought that was what the cats attribute was for, but that doesn’t behave the way I’d expect.

>>> nlp = spacy.load("en_core_web_lg")
>>> doc = nlp("This is America.")
>>> [entity.label_ for entity in doc.ents]
['GPE']
>>> doc.cats
{}

The documentation and recipe code makes it looks like the EnityRecognizer is what I want. You initialize it with a model and then it returns entities and scores, but I not sure what to pass as input to EnityRecognizer.

That last evaluate function I wrote above does everything I want except it it uses all the entities a model hypothesizes to calculate the score. I can’t figure out how to select just that subset of entities that have a score > t.

In case others like me come looking for a basic scoring recipe, here is what I cooked up.

It doesn’t consider threshold, but it evaluates model accuracy without re-training and can output either PRF or the standard prodigy score scheme.

import spacy
import spacy.scorer
from prodigy.components.db import connect
from prodigy.core import recipe, recipe_args
from prodigy.models.ner import EntityRecognizer, merge_spans
from prodigy.util import log
from prodigy.components.preprocess import split_sentences, add_tokens


def gold_to_spacy(dataset, spacy_model, biluo=False):
    #### Ripped from ner.gold_to_spacy. Only change is returning annotations instead of printing or saving
    DB = connect()
    examples = DB.get_dataset(dataset)
    examples = [eg for eg in examples if eg['answer'] == 'accept']
    if biluo:
        if not spacy_model:
            prints("Exporting annotations in BILUO format requires a spaCy "
                   "model for tokenization.", exits=1, error=True)
        nlp = spacy.load(spacy_model)
    annotations = []
    for eg in examples:
        entities = [(span['start'], span['end'], span['label'])
                    for span in eg.get('spans', [])]
        if biluo:
            doc = nlp(eg['text'])
            entities = spacy.gold.biluo_tags_from_offsets(doc, entities)
            annot_entry = [eg['text'], entities]
        else:
            annot_entry = [eg['text'], {'entities': entities}]
        annotations.append(annot_entry)

    return annotations

def evaluate_prf(ner_model, examples):
    #### Source: https://stackoverflow.com/questions/44827930/evaluation-in-a-spacy-ner-model
    scorer = spacy.scorer.Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = spacy.gold.GoldParse(doc_gold_text, entities=annot['entities'])
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

@recipe("ner.stats",
        dataset=recipe_args["dataset"],
        spacy_model=recipe_args["spacy_model"],
        label=recipe_args["entity_label"],
        isPrf=("Output Precsion, Recall, F-Score", "flag", "prf"))

def model_stats(dataset, spacy_model, label=None, isPrf=False):
    """
    Evaluate model accuracy of model based on dataset with no training
    inspired from https://support.prodi.gy/t/evaluating-precision-and-recall-of-ner/193/2
    got basic model evaluation by looking at the batch-train recipe
    """
   
    log("RECIPE: Starting recipe ner.stats", locals())
    DB = connect()
    nlp = spacy.load(spacy_model)
    

    if(isPrf):
        examples = gold_to_spacy(dataset, spacy_model)
        score = evaluate_prf(nlp, examples)
        print("Precision {:0.4f}\tRecall {:0.4f}\tF-score {:0.4f}".format(score['ents_p'], score['ents_r'], score['ents_f']))

    else:
        #ripped this from ner.batch-train recipe
        model = EntityRecognizer(nlp, label=label)
        evaldoc = merge_spans(DB.get_dataset(dataset))
        evals = list(split_sentences(model.orig_nlp, evaldoc))

        scores = model.evaluate(evals)

        print("Accuracy {:0.4f}\tRight {:0.0f}\tWrong {:0.0f}\tUnknown {:0.0f}\tEntities {:0.0f}".format(scores['acc'], scores['right'],scores['wrong'],scores['unk'],scores['ents']))
2 Likes

This only takes uses an annotation if the label was present. For instance if in your validation set you annotate a sentence that does not contain the entity you are detecting then the accuracy score does not use that annotation.

You can always write your own scoring. The above recipe shows how to marshal results into a comparable format. It wouldn’t be crazy to write your own comparison logic from there.

You can see the spacy scorer at https://github.com/explosion/spaCy/blob/master/spacy/scorer.py if you need a starting point