Recall and Precision (TN, TP, FN, FP)

AvE · May 13, 2019, 10:34am

Hi,

I have a question.
Is it possible to show true positives, true negatives, false positives, false negatives, recall and the precision of a trained model somewhere within Prodigy?
If not, is there a way you could calculate this yourself?

I trained a model using the following video: https://www.youtube.com/watch?time_continue=1484&v=l4scwf8KeIA

Thanks,
Anne

honnibal · May 14, 2019, 4:17pm

The prodigy.models.ner.EntryRecognizer.evaluate() method will tell you the accuracy of the model, but doesn’t currently return P/R/F scores. The method supports the use-case where the gold-standard has only entities known to be correct, without necessarily containing all of the correct entities — i.e., the use-case where the gold-standard has missing values. You should specify the flag no_missing=True if you don’t have missing values in your gold-standard.

Here’s some code to return P/R/F, assuming you have no missing values in your gold standard:


tp = 0.0
fp = 0.0
fn = 0.0
for eg in test_examples:
    doc = nlp(eg["text"])
    guesses = set((ent.start_char, ent.eng_char, ent.label_) for ent in doc.ents)
    truths = set((span["start"], span["end"], span["label"]) for span in eg["spans"])
    tp += len(guesses.intersection(truths))
    fn += len(truths - guesses)
    fp += len(guesses - truths)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)

dsnlp · May 14, 2019, 4:43pm

Hi @honnibal
I recently started using Prodigy, annotated data and trained the model, Now I am trying to check for accuracy of the model and later calculate precision, recall and F-score for the same. I have a few questions.

Should I just pass my model to prodigy.models.ner.EntryRecognizer.evaluate(/path/to/my/model)? (It should be EntityRecognizer? Typo? ).I tried the same, but got the following error:
TypeError: evaluate() takes exactly 2 positional arguments (1 given)
I am trying to understand what a 'Gold-standard' model is? I have annotated the data and trained the model in Spacy, but what exactly is a Gold-standard. I am missing something obvious here.

ines · May 14, 2019, 4:55pm

@dsnlp Yes, that’s EntityRecognizer, so definitely a typo. This class is Prodigy’s built-in annotation model – so basically, the wrapper that takes care of scoring the examples, updating a model with (incomplete) annotations and so on.

You can find more details on the API, how to initialize the model and what arguments the methods take in your PRODIGY_README.html. The EntityRecognizer is initialized with a loaded nlp object, and you can then call the evaluate method on a list of examples.

dsnlp · May 14, 2019, 5:14pm

@ines Thanks. Unfortunately, I do not have access to the documentation as the installation was handled by someone else. Is there a min working example that you can help providing reference to?

ines · May 15, 2019, 8:44am

@dsnlp Ah, that sucks – you should definitely get that README, since it includes all the detailed API docs. Any way you can contact the person who received the Prodigy installer? Otherwise, if you have the order ID (starting with #EX), email us and we can re-send it.

A minimal example could look something like this:

nlp = spacy.load("en_core_web_sm")
model = EntityRecognizer(nlp, label=['PERSON', 'ORG'])
stats = model.evaluate(examples, no_missing=True)

dsnlp · May 15, 2019, 7:18pm

@ines Sure, I will try to get it. Also, the other question, what exactly a Gold standard model? Even in the evaluate function, there is mention of ‘golds’ (https://github.com/explosion/spaCy/blob/v2.0.5/spacy/language.py#L459)

 def evaluate(self, docs_golds, verbose=False):
        scorer = Scorer()
        docs, golds = zip(*docs_golds)
        docs = list(docs)
        golds = list(golds)
        for name, pipe in self.pipeline:
            if not hasattr(pipe, 'pipe'):
                docs = (pipe(doc) for doc in docs)
            else:
                docs = pipe.pipe(docs, batch_size=256)
        for doc, gold in zip(docs, golds):
            if verbose:
                print(doc)
            scorer.score(doc, gold, verbose=verbose)
        return scorer

honnibal · May 16, 2019, 12:32pm

@dsnlp apologies if this isn’t answering the right question, but: In machine learning parlance, the term “gold standard” really just means the reference annotations — the ‘correct answer’ you’re trying to predict. There are unfortunately lots of these little terms-of-art in machine learning and NLP. I think one of the best practical discussions of evaluation in ML is in this short primer by Andrew Ng: https://www.mlyearning.org/

AvE · May 17, 2019, 2:16pm

Thank you for your answer, I now manage to calculate it.

Topic		Replies	Views
Evaluation metric: Scorer function returns same values for F,P,R ner , spacy , solved	1	592	May 21, 2019
Specific formula for F score, precision and recall NER usage , spacy , training	1	986	July 10, 2021
Evaluating Precision and Recall of NER ner , solved	6	11948	April 30, 2020
Create baseline metrics based on manual NER annotations usage , ner , solved	3	671	June 8, 2020
Calculating accuracy, precision, recall, and f1 score from evaluation.jsonl file from ner.batch-train usage , ner , spacy	2	1376	January 9, 2020

Recall and Precision (TN, TP, FN, FP)

Related topics