show false negative/false positives in NER

Hey :slight_smile:

I am trying to find out which of the entities annotated for NER are either skipped (false negatives) or which pieces of text the model is incorrectly picking up as entities (false positives). Is there an easy way to do this via the Prodigy/Spacy API?

I hacked my way through the code a bit but couldn't find anything. The closest I could get from the train recipe was the scores object, but that only contained the scoring. It would be really nice to store the predictions. Then we could compute other metrics / plots (conf. matrix, etc.)

1 Like

Hi! There's no built-in function for this at the moment, but it should be pretty straightforward to implement. You probably want to do this as a separate step, though, after you've trained the model – and you probably also want to use a separate dedicated evaluation set instead of just doing a random split so you can compare the results more reliably (if you're not doing that already).

To get the false positives/negatives, you can then process your evaluation data with your trained model and compare the "spans" against the predicted doc.ents:

data_tuples = ((eg["text"], eg) for eg in your_evaluation_data)
nlp = spacy.load("./your_trained_model")
for doc, eg in nlp.pipe(data_tuples, as_tuples=True):
    correct_ents = [(e["start"], e["end"], e["label"]) for e in eg["spans"])
    predicted_ents = [(e.start_char, e.end_char. e.label_) for e in doc.ents]
    for ent in predicted_ents:
        if ent not in correct_ents:
            print("False positive:", ent)
    for ent in correct_ents:
        if ent not in predicted_ents:
            print("False negative:", ent)
1 Like

Hi, the above does not seem to work in spacy 3.0.3. I tried

def confusion_matrix(your_evaluation_data=None, ner_model = None, nameForNewLabel='PRODUCTS'):
    #
    tp,fp,fn,tn = 0,0,0,0
    #
    data_tuples = [(eg.text, eg) for eg in your_evaluation_data]
    # see https://spacy.io/api/language#pipe
    for doc, example in ner_model.pipe(data_tuples, as_tuples=True):
        # correct_ents
        ents_x2y = example.get_aligned_spans_x2y(example.reference.ents)
        correct_ents = [(e.start_char, e.end_char, e.label_) for e in ents_x2y]
        # predicted_ents
        ents_x2y = example.get_aligned_spans_x2y(doc.ents)
        predicted_ents = [(e.start_char, e.end_char, e.label_) for e in ents_x2y]
        #
        for ent in predicted_ents:
            if ent not in correct_ents:
                print("False positive:", ent)
        for ent in correct_ents:
            if ent not in predicted_ents:
                print("False negative:", ent)
        # false positives
        fp += len([ent for ent in predicted_ents if ent not in correct_ents])
        # true positives
        tp += len([ent for ent in correct_ents if ent in predicted_ents])
        # false negatives
        fn += len([ent for ent in correct_ents if ent not in predicted_ents])
        # true negatives
        tn += len([ent for ent in predicted_ents if ent in correct_ents])
    
    return tp,fp,fn,tn 

but after a lot of effort those averages do not match the values I see in

    scores_testing = ner_model.evaluate(test_data)
    print("scores_training")
    print(scores_testing)
    precision_test = scores_testing['ents_per_type'][nameForNewLabel]['p']
    recall_test = scores_testing['ents_per_type'][nameForNewLabel]['r']
    f1_test = scores_testing['ents_per_type'][nameForNewLabel]['f']

Any clues why? Many thanks, Eurico

Hi! The custom function you wrote takes a parameter nameForNewLabel but it looks like that isn't actually being used, so it'll return results aggregated from all labels. That also means you'll have to compare it with scores_testing['ents_p'] (etc) instead of the label-specific values.

If it still doesn't match - can you paste some actual numbers to check exactly what the difference is?

Hi many thanks for your quick answer. It was my mistake, now I get exact match with your spacy internal numbers, both precision and recall and f1. I changed my code to

def confusion_matrix(your_evaluation_data=None, ner_model = None):
    #
    tp,fp,fn,tn = 0,0,0,0
    #
    data_tuples = [(eg.text, eg) for eg in your_evaluation_data]
    # see https://spacy.io/api/language#pipe
    for doc, example in ner_model.pipe(data_tuples, as_tuples=True):
        # correct_ents
        ents_x2y = example.get_aligned_spans_x2y(example.reference.ents)
        correct_ents = [(e.start_char, e.end_char, e.label_) for e in ents_x2y]
        # predicted_ents
        ents_x2y = example.get_aligned_spans_x2y(doc.ents)
        predicted_ents = [(e.start_char, e.end_char, e.label_) for e in ents_x2y]
        #
        for ent in predicted_ents:
            if ent not in correct_ents:
                print("False positive:", ent)
        for ent in correct_ents:
            if ent not in predicted_ents:
                print("False negative:", ent)
        # false positives
        fp += len([ent for ent in predicted_ents if ent not in correct_ents])
        # true positives
        tp += len([ent for ent in predicted_ents if ent in correct_ents])
        # false negatives
        fn += len([ent for ent in correct_ents if ent not in predicted_ents])
        # true negatives
        tn += len([ent for ent in correct_ents if ent in predicted_ents])
    
    return tp,fp,fn,tn 

However, it would still be nice to get tp,fp,fn,tn out via spacy, maybe one day you could add that feature. Precision and recall are good to have, but it is even better to get raw tp,fp,fn,tn number for detailed debugging. Thanks and well done on doing spacy, amazing great package!!!!

Happy to hear you got it working and thanks for posting the code as reference - it might always be useful for others finding this topic later :slight_smile:

We'd have to think about how to get that functionality into spaCy without causing too much additional overhead when running the training, because often you really only want the numbers and additional information would just take up memory. But yes you're right that it would be a convenient feature for diving into your model's predictions.