Evaluating Precision and Recall of NER

ner
solved

(W.P. McNeill) #1

I want to evaluate the precision and recall of an NER model on an annotated dataset. The recipe would look like this

prodigy ner.pr -F ner_pr.py evaluation-dataset model --label MY_LABEL --threshold 0.5
Precision 0.8333        Recall 0.9091   F-score 0.8696

The model predicts named entities in the text in the evaluation dataset. Each entity predicted with a score above the threshold is compared to the true entities in the dataset to generate precision and recall statistics.

I think I need to use a prodigy.models.ner.EntityRecognizer object. I’ve been looking through the documentation and sample code, but haven’t figured out how to do this. Here’s what I have written so far.

import spacy
from prodigy.components.db import connect
from prodigy.core import recipe, recipe_args
from prodigy.models.ner import EntityRecognizer
from prodigy.util import log

DB = connect()


@recipe("ner.pr",
        dataset=recipe_args["dataset"],
        spacy_model=recipe_args["spacy_model"],
        label=recipe_args["entity_label"],
        threshold=("detection threshold", "option", "t", float))
def precision_recall(dataset, spacy_model, label=None, threshold=0.5):
    """
    Calculate precision and recall of NER predictions.
    """

    # I don't know what to do here.
    def evaluate(model, samples, label, threshold):
        return 10, 2, 1

    log("RECIPE: Starting recipe ner.pr", locals())
    model = EntityRecognizer(spacy.load(spacy_model), label=label)
    log('RECIPE: Initialised EntityRecognizer with model {}'.format(spacy_model), model.nlp.meta)
    samples = DB.get_dataset(dataset)
    tp, fp, fn = evaluate(model, samples, threshold)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    print("Precision {:0.4f}\tRecall {:0.4f}\tF-score {:0.4f}".format(precision, recall, f_score))

How do I write evaluate(model, samples, threshold) so that it actually calculates the true positives, false positives, and false negatives?

I can do most of this with the spaCy model, but I don’t know how to get scores so I can’t incorporate the threshold.

def evaluate(model, samples, label, threshold):
    tp = fp = fn = 0
    for sample in samples:
        truth = set((span["start"], span["end"]) for span in sample["spans"] if span["label"] == label)
        hypotheses = set((entity.start_char, entity.end_char)
                         for entity in model.nlp(sample["text"]).ents if entity.label_ == label)
        tp += len(truth.intersection(hypotheses))
        fp += len(hypotheses - truth)
        fn += len(truth - hypotheses)
    return tp, fp, fn

I thought this was what Doc.cats was for but here’s what I get from that attribute on a document containing a GPE

>>> nlp = spacy.load("en")
>>> nlp("Hello America").cats
{}

This is spaCy version 2.0.5.


(Matthew Honnibal) #2

Most people use the functions in scikit-learn for these things. Personally I don’t, because none of our libraries depend on sklearn. So I usually implement P/R/F close to where I’m using it, as it’s a pretty simple metric.

There’s a Scorer() class in spaCy that you might find useful if you don’t want to use scikit-learn: https://github.com/explosion/spaCy/blob/master/spacy/scorer.py . You can also use the nlp.evaluate() method: https://github.com/explosion/spaCy/blob/v2.0.5/spacy/language.py#L459


(W.P. McNeill) #3

I’m asking something a little different. Like you I find it easiest to write my own P/R code. (And it appears that Scorer and nlp.evaluate are utilities that calculate P/R from the spaCy data structures.) But additionally I want to calculate P/R at a given threshold. So the model I’m evaluating is only considered to hypothesize an entity if its confidence score for that entity is above a given threshold t. The goal is to run this for a range of thresholds and draw an f-score ROC curve.

The part I can’t figure out is how to get the model to return scores for the entities it hypothesizes. I thought that was what the cats attribute was for, but that doesn’t behave the way I’d expect.

>>> nlp = spacy.load("en_core_web_lg")
>>> doc = nlp("This is America.")
>>> [entity.label_ for entity in doc.ents]
['GPE']
>>> doc.cats
{}

The documentation and recipe code makes it looks like the EnityRecognizer is what I want. You initialize it with a model and then it returns entities and scores, but I not sure what to pass as input to EnityRecognizer.

That last evaluate function I wrote above does everything I want except it it uses all the entities a model hypothesizes to calculate the score. I can’t figure out how to select just that subset of entities that have a score > t.


Should _input_hash be required on the input to EntityRecognizer?
(Spencer) #4

In case others like me come looking for a basic scoring recipe, here is what I cooked up.

It doesn’t consider threshold, but it evaluates model accuracy without re-training and can output either PRF or the standard prodigy score scheme.

import spacy
import spacy.scorer
from prodigy.components.db import connect
from prodigy.core import recipe, recipe_args
from prodigy.models.ner import EntityRecognizer, merge_spans
from prodigy.util import log
from prodigy.components.preprocess import split_sentences, add_tokens


def gold_to_spacy(dataset, spacy_model, biluo=False):
    #### Ripped from ner.gold_to_spacy. Only change is returning annotations instead of printing or saving
    DB = connect()
    examples = DB.get_dataset(dataset)
    examples = [eg for eg in examples if eg['answer'] == 'accept']
    if biluo:
        if not spacy_model:
            prints("Exporting annotations in BILUO format requires a spaCy "
                   "model for tokenization.", exits=1, error=True)
        nlp = spacy.load(spacy_model)
    annotations = []
    for eg in examples:
        entities = [(span['start'], span['end'], span['label'])
                    for span in eg.get('spans', [])]
        if biluo:
            doc = nlp(eg['text'])
            entities = spacy.gold.biluo_tags_from_offsets(doc, entities)
            annot_entry = [eg['text'], entities]
        else:
            annot_entry = [eg['text'], {'entities': entities}]
        annotations.append(annot_entry)

    return annotations

def evaluate_prf(ner_model, examples):
    #### Source: https://stackoverflow.com/questions/44827930/evaluation-in-a-spacy-ner-model
    scorer = spacy.scorer.Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = spacy.gold.GoldParse(doc_gold_text, entities=annot['entities'])
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

@recipe("ner.stats",
        dataset=recipe_args["dataset"],
        spacy_model=recipe_args["spacy_model"],
        label=recipe_args["entity_label"],
        isPrf=("Output Precsion, Recall, F-Score", "flag", "prf"))

def model_stats(dataset, spacy_model, label=None, isPrf=False):
    """
    Evaluate model accuracy of model based on dataset with no training
    inspired from https://support.prodi.gy/t/evaluating-precision-and-recall-of-ner/193/2
    got basic model evaluation by looking at the batch-train recipe
    """
   
    log("RECIPE: Starting recipe ner.stats", locals())
    DB = connect()
    nlp = spacy.load(spacy_model)
    

    if(isPrf):
        examples = gold_to_spacy(dataset, spacy_model)
        score = evaluate_prf(nlp, examples)
        print("Precision {:0.4f}\tRecall {:0.4f}\tF-score {:0.4f}".format(score['ents_p'], score['ents_r'], score['ents_f']))

    else:
        #ripped this from ner.batch-train recipe
        model = EntityRecognizer(nlp, label=label)
        evaldoc = merge_spans(DB.get_dataset(dataset))
        evals = list(split_sentences(model.orig_nlp, evaldoc))

        scores = model.evaluate(evals)

        print("Accuracy {:0.4f}\tRight {:0.0f}\tWrong {:0.0f}\tUnknown {:0.0f}\tEntities {:0.0f}".format(scores['acc'], scores['right'],scores['wrong'],scores['unk'],scores['ents']))

Record training results
Prodigy NER model evaluation and custom evaluation scripts
How do I use prodigy as a purely annotation tool with no underlying SpaCy model?
mismatch between accuracy on holdout set and batch-train accuracy
(Michael Higgins) #5

This only takes uses an annotation if the label was present. For instance if in your validation set you annotate a sentence that does not contain the entity you are detecting then the accuracy score does not use that annotation.


(Spencer) #6

You can always write your own scoring. The above recipe shows how to marshal results into a comparable format. It wouldn’t be crazy to write your own comparison logic from there.

You can see the spacy scorer at https://github.com/explosion/spaCy/blob/master/spacy/scorer.py if you need a starting point