Calculating accuracy, precision, recall, and f1 score from evaluation.jsonl file from ner.batch-train

arranjdavis · January 8, 2020, 1:26pm

Hello, I am experiencing a similar problem. I also want to calculate accuracy, recall, etc. from the evaluation.jsonl file I am getting from ner.batch-train. The dataset I trained on was created using ner.teach, and it contains multiple spans referencing the same text but sometimes with conflicting model scores (see the below example):

{'text': "RT/SHARE We are SO THRILLED that actions are planned on EVERY continent (except Antarctica, c'mon guys) for the Global Day of Action for the Amazon on Sep 5th!",
 '_input_hash': 953315024,
 '_task_hash': -1148677878,
 'tokens': [{'text': 'RT', 'start': 0, 'end': 2, 'id': 0},
  {'text': '/', 'start': 2, 'end': 3, 'id': 1},
  {'text': 'SHARE', 'start': 3, 'end': 8, 'id': 2},
  {'text': 'We', 'start': 9, 'end': 11, 'id': 3},
  {'text': 'are', 'start': 12, 'end': 15, 'id': 4},
  {'text': 'SO', 'start': 16, 'end': 18, 'id': 5},
  {'text': 'THRILLED', 'start': 19, 'end': 27, 'id': 6},
  {'text': 'that', 'start': 28, 'end': 32, 'id': 7},
  {'text': 'actions', 'start': 33, 'end': 40, 'id': 8},
  {'text': 'are', 'start': 41, 'end': 44, 'id': 9},
  {'text': 'planned', 'start': 45, 'end': 52, 'id': 10},
  {'text': 'on', 'start': 53, 'end': 55, 'id': 11},
  {'text': 'EVERY', 'start': 56, 'end': 61, 'id': 12},
  {'text': 'continent', 'start': 62, 'end': 71, 'id': 13},
  {'text': '(', 'start': 72, 'end': 73, 'id': 14},
  {'text': 'except', 'start': 73, 'end': 79, 'id': 15},
  {'text': 'Antarctica', 'start': 80, 'end': 90, 'id': 16},
  {'text': ',', 'start': 90, 'end': 91, 'id': 17},
  {'text': "c'mon", 'start': 92, 'end': 97, 'id': 18},
  {'text': 'guys', 'start': 98, 'end': 102, 'id': 19},
  {'text': ')', 'start': 102, 'end': 103, 'id': 20},
  {'text': 'for', 'start': 104, 'end': 107, 'id': 21},
  {'text': 'the', 'start': 108, 'end': 111, 'id': 22},
  {'text': 'Global', 'start': 112, 'end': 118, 'id': 23},
  {'text': 'Day', 'start': 119, 'end': 122, 'id': 24},
  {'text': 'of', 'start': 123, 'end': 125, 'id': 25},
  {'text': 'Action', 'start': 126, 'end': 132, 'id': 26},
  {'text': 'for', 'start': 133, 'end': 136, 'id': 27},
  {'text': 'the', 'start': 137, 'end': 140, 'id': 28},
  {'text': 'Amazon', 'start': 141, 'end': 147, 'id': 29},
  {'text': 'on', 'start': 148, 'end': 150, 'id': 30},
  {'text': 'Sep', 'start': 151, 'end': 154, 'id': 31},
  {'text': '5th', 'start': 155, 'end': 158, 'id': 32},
  {'text': '!', 'start': 158, 'end': 159, 'id': 33}],
 'spans': [{'text': 'Amazon',
   'start': 141,
   'end': 147,
   'priority': 0.3181818182,
   'score': 0.3181818182,
   'pattern': -1172721813,
   'label': 'ORG',
   'answer': 'reject',
   'token_start': 29,
   'token_end': 29},
  {'text': 'Amazon',
   'start': 141,
   'end': 147,
   'priority': 0.5205479452,
   'score': 0.5205479452,
   'pattern': -1172721813,
   'label': 'ORG',
   'answer': 'reject',
   'token_start': 29,
   'token_end': 29}],
 'meta': {'score': 0.3181818182, 'pattern': 94},
 '_session_id': 'tracker_ner-default',
 '_view_id': 'ner',
 'answer': 'reject'}

I have tried to replicate @honnibal's code for finding precision, recall, etc., and I have also added an accuracy calculator. However, the accuracy is not the same score as the accuracy score give with ner.batch-train, which produced the evaluate.jsonl file that is the source for the dictionary_list variable in the code below.

tp = 0.0
fp = 0.0
tn = 0.0
fn = 0.0
ignored = 0.0
LABEL = "ORG"

for eg in dictionary_list:
    if eg['answer'] == 'ignore':
        ignored += 1
    else:
        doc = nlp(eg['text'])

        guesses = set((ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ == LABEL)
        positives = set((span["start"], span["end"], span["label"]) for span in eg["spans"] if span['answer'] == 'accept')
    
        tp += len(guesses.intersection(positives))
        fn += len(positives - guesses)
        fp += len(guesses - positives)
        if len(guesses) == 0 and len(positives) == 0:
            tn += 1

acc = tp/(tp+fn+fp+tn+1e-10)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)

I am struggling to understand how Prodigy calculates accuracy given the information in the evaluate.jsonl file, and how it deals with conflicting classification score for otherwise identical spans within an example. On top of replicating the accuracy score given by ner.batch-train I would also like to properly calculate precision, recall, and fscore from the evaluate.jsonl file.

I've tried to back engineer how prodigy calculates accuracy in ner.batch-train, but didn't get very far. I noticed that best_stats has a "unk" key value. Is this perhaps when there are conflicting classifications of the same span? Any help appreciated!

Topic		Replies	Views
Specific formula for F score, precision and recall NER usage , spacy , training	1	977	July 10, 2021
Evaluating Precision and Recall of NER ner , solved	6	11931	April 30, 2020
Prodigy Train NER Results explanation usage , ner , solved	4	617	July 7, 2021
Evaluation data for ner model ner	2	378	October 11, 2023
Prodigy NER model evaluation and custom evaluation scripts ner , spacy	5	2131	February 1, 2023

Calculating accuracy, precision, recall, and f1 score from evaluation.jsonl file from ner.batch-train

Related topics