Recall and Precision (TN, TP, FN, FP)

ner
spacy
(Anne) #1

Hi,

I have a question.
Is it possible to show true positives, true negatives, false positives, false negatives, recall and the precision of a trained model somewhere within Prodigy?
If not, is there a way you could calculate this yourself?

I trained a model using the following video: https://www.youtube.com/watch?time_continue=1484&v=l4scwf8KeIA

Thanks,
Anne

Gold notation, Test/Eval set for already trained model
(Matthew Honnibal) #2

The prodigy.models.ner.EntryRecognizer.evaluate() method will tell you the accuracy of the model, but doesn’t currently return P/R/F scores. The method supports the use-case where the gold-standard has only entities known to be correct, without necessarily containing all of the correct entities — i.e., the use-case where the gold-standard has missing values. You should specify the flag no_missing=True if you don’t have missing values in your gold-standard.

Here’s some code to return P/R/F, assuming you have no missing values in your gold standard:


tp = 0.0
fp = 0.0
fn = 0.0
for eg in test_examples:
    doc = nlp(eg["text"])
    guesses = set((ent.start_char, ent.eng_char, ent.label_) for ent in doc.ents)
    truths = set((span["start"], span["end"], span["label"]) for span in eg["spans"])
    tp += len(guesses.intersection(truths))
    fn += len(truths - guesses)
    fp += len(guesses - truths)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)
1 Like
#3

Hi @honnibal
I recently started using Prodigy, annotated data and trained the model, Now I am trying to check for accuracy of the model and later calculate precision, recall and F-score for the same. I have a few questions.

  • Should I just pass my model to prodigy.models.ner.EntryRecognizer.evaluate(/path/to/my/model)? (It should be EntityRecognizer? Typo? ).I tried the same, but got the following error:
    TypeError: evaluate() takes exactly 2 positional arguments (1 given)
  • I am trying to understand what a ‘Gold-standard’ model is? I have annotated the data and trained the model in Spacy, but what exactly is a Gold-standard. I am missing something obvious here.
(Ines Montani) #4

@dsnlp Yes, that’s EntityRecognizer, so definitely a typo. This class is Prodigy’s built-in annotation model – so basically, the wrapper that takes care of scoring the examples, updating a model with (incomplete) annotations and so on.

You can find more details on the API, how to initialize the model and what arguments the methods take in your PRODIGY_README.html. The EntityRecognizer is initialized with a loaded nlp object, and you can then call the evaluate method on a list of examples.

#5

@ines Thanks. Unfortunately, I do not have access to the documentation as the installation was handled by someone else. Is there a min working example that you can help providing reference to?

(Ines Montani) #6

@dsnlp Ah, that sucks – you should definitely get that README, since it includes all the detailed API docs. Any way you can contact the person who received the Prodigy installer? Otherwise, if you have the order ID (starting with #EX), email us and we can re-send it.

A minimal example could look something like this:

nlp = spacy.load("en_core_web_sm")
model = EntityRecognizer(nlp, label=['PERSON', 'ORG'])
stats = model.evaluate(examples, no_missing=True)
#7

@ines Sure, I will try to get it. Also, the other question, what exactly a Gold standard model? Even in the evaluate function, there is mention of ‘golds’ (https://github.com/explosion/spaCy/blob/v2.0.5/spacy/language.py#L459)

 def evaluate(self, docs_golds, verbose=False):
        scorer = Scorer()
        docs, golds = zip(*docs_golds)
        docs = list(docs)
        golds = list(golds)
        for name, pipe in self.pipeline:
            if not hasattr(pipe, 'pipe'):
                docs = (pipe(doc) for doc in docs)
            else:
                docs = pipe.pipe(docs, batch_size=256)
        for doc, gold in zip(docs, golds):
            if verbose:
                print(doc)
            scorer.score(doc, gold, verbose=verbose)
        return scorer
(Matthew Honnibal) #8

@dsnlp apologies if this isn’t answering the right question, but: In machine learning parlance, the term “gold standard” really just means the reference annotations — the ‘correct answer’ you’re trying to predict. There are unfortunately lots of these little terms-of-art in machine learning and NLP. I think one of the best practical discussions of evaluation in ML is in this short primer by Andrew Ng: https://www.mlyearning.org/

(Anne) #9

Thank you for your answer, I now manage to calculate it.