Calculating accuracy, precision, recall, and f1 score from evaluation.jsonl file from ner.batch-train

Hello,

I am trying to calculate accuracy, precision, recall, and f1 scores for a model created using ner.batch-train on data annotated using ner.teach.

I am using the outputted evalutation.jsonl file, and I am a bit confused by the spans section of the dictionaries. Take this as an example:

{'text': 'if thats the case thats as true as me saying im richer than microsoft, sony, epic games, apple, samsung combined... wouldnt that be a life to live XD', '_input_hash': -1603676578, '_task_hash': 958550217, 'tokens': [{'text': 'if', 'start': 0, 'end': 2, 'id': 0}, {'text': 'that', 'start': 3, 'end': 7, 'id': 1}, {'text': 's', 'start': 7, 'end': 8, 'id': 2}, {'text': 'the', 'start': 9, 'end': 12, 'id': 3}, {'text': 'case', 'start': 13, 'end': 17, 'id': 4}, {'text': 'that', 'start': 18, 'end': 22, 'id': 5}, {'text': 's', 'start': 22, 'end': 23, 'id': 6}, {'text': 'as', 'start': 24, 'end': 26, 'id': 7}, {'text': 'true', 'start': 27, 'end': 31, 'id': 8}, {'text': 'as', 'start': 32, 'end': 34, 'id': 9}, {'text': 'me', 'start': 35, 'end': 37, 'id': 10}, {'text': 'saying', 'start': 38, 'end': 44, 'id': 11}, {'text': 'i', 'start': 45, 'end': 46, 'id': 12}, {'text': 'm', 'start': 46, 'end': 47, 'id': 13}, {'text': 'richer', 'start': 48, 'end': 54, 'id': 14}, {'text': 'than', 'start': 55, 'end': 59, 'id': 15}, {'text': 'microsoft', 'start': 60, 'end': 69, 'id': 16}, {'text': ',', 'start': 69, 'end': 70, 'id': 17}, {'text': 'sony', 'start': 71, 'end': 75, 'id': 18}, {'text': ',', 'start': 75, 'end': 76, 'id': 19}, {'text': 'epic', 'start': 77, 'end': 81, 'id': 20}, {'text': 'games', 'start': 82, 'end': 87, 'id': 21}, {'text': ',', 'start': 87, 'end': 88, 'id': 22}, {'text': 'apple', 'start': 89, 'end': 94, 'id': 23}, {'text': ',', 'start': 94, 'end': 95, 'id': 24}, {'text': 'samsung', 'start': 96, 'end': 103, 'id': 25}, {'text': 'combined', 'start': 104, 'end': 112, 'id': 26}, {'text': '...', 'start': 112, 'end': 115, 'id': 27}, {'text': 'would', 'start': 116, 'end': 121, 'id': 28}, {'text': 'nt', 'start': 121, 'end': 123, 'id': 29}, {'text': 'that', 'start': 124, 'end': 128, 'id': 30}, {'text': 'be', 'start': 129, 'end': 131, 'id': 31}, {'text': 'a', 'start': 132, 'end': 133, 'id': 32}, {'text': 'life', 'start': 134, 'end': 138, 'id': 33}, {'text': 'to', 'start': 139, 'end': 141, 'id': 34}, {'text': 'live', 'start': 142, 'end': 146, 'id': 35}, {'text': 'XD', 'start': 147, 'end': 149, 'id': 36}], 'spans': [{'text': 'sony', 'start': 71, 'end': 75, 'priority': 0.5, 'score': 0.5, 'pattern': 422137312, 'label': 'ORG', 'answer': 'accept', 'token_start': 18, 'token_end': 18}, {'text': 'microsoft', 'start': 60, 'end': 69, 'priority': 0.4444444444, 'score': 0.4444444444, 'pattern': -1043368242, 'label': 'ORG', 'answer': 'accept', 'token_start': 16, 'token_end': 16}, {'text': 'microsoft', 'start': 60, 'end': 69, 'priority': 0.5555555556, 'score': 0.5555555556, 'pattern': -1043368242, 'label': 'ORG', 'answer': 'accept', 'token_start': 16, 'token_end': 16}, {'text': 'microsoft', 'start': 60, 'end': 69, 'priority': 0.5, 'score': 0.5, 'pattern': -1043368242, 'label': 'ORG', 'answer': 'accept', 'token_start': 16, 'token_end': 16}], 'meta': {'score': 0.5, 'pattern': 75}, '_session_id': 'tracker_ner-default', '_view_id': 'ner', 'answer': 'accept'}

My questions are:

  1. Why does 'microsoft' appear multiple times in spans with the exact same information except different score values?

  2. How do you go about calculating accuracy, precision, etc. from this information? What is the cut-off used for the score variable (e.g., <.5, <=.5, etc.)? Of course, I am aiming to have my model accuracy score match that printed by the prodigy ner.batch-train command.

Thanks!

PS the formatting on the dictionary is a bit ugly, here is a photo that might make things more clear:

Hello, I am experiencing a similar problem. I also want to calculate accuracy, recall, etc. from the evaluation.jsonl file I am getting from ner.batch-train. The dataset I trained on was created using ner.teach, and it contains multiple spans referencing the same text but sometimes with conflicting model scores (see the below example):

{'text': "RT/SHARE We are SO THRILLED that actions are planned on EVERY continent (except Antarctica, c'mon guys) for the Global Day of Action for the Amazon on Sep 5th!",
 '_input_hash': 953315024,
 '_task_hash': -1148677878,
 'tokens': [{'text': 'RT', 'start': 0, 'end': 2, 'id': 0},
  {'text': '/', 'start': 2, 'end': 3, 'id': 1},
  {'text': 'SHARE', 'start': 3, 'end': 8, 'id': 2},
  {'text': 'We', 'start': 9, 'end': 11, 'id': 3},
  {'text': 'are', 'start': 12, 'end': 15, 'id': 4},
  {'text': 'SO', 'start': 16, 'end': 18, 'id': 5},
  {'text': 'THRILLED', 'start': 19, 'end': 27, 'id': 6},
  {'text': 'that', 'start': 28, 'end': 32, 'id': 7},
  {'text': 'actions', 'start': 33, 'end': 40, 'id': 8},
  {'text': 'are', 'start': 41, 'end': 44, 'id': 9},
  {'text': 'planned', 'start': 45, 'end': 52, 'id': 10},
  {'text': 'on', 'start': 53, 'end': 55, 'id': 11},
  {'text': 'EVERY', 'start': 56, 'end': 61, 'id': 12},
  {'text': 'continent', 'start': 62, 'end': 71, 'id': 13},
  {'text': '(', 'start': 72, 'end': 73, 'id': 14},
  {'text': 'except', 'start': 73, 'end': 79, 'id': 15},
  {'text': 'Antarctica', 'start': 80, 'end': 90, 'id': 16},
  {'text': ',', 'start': 90, 'end': 91, 'id': 17},
  {'text': "c'mon", 'start': 92, 'end': 97, 'id': 18},
  {'text': 'guys', 'start': 98, 'end': 102, 'id': 19},
  {'text': ')', 'start': 102, 'end': 103, 'id': 20},
  {'text': 'for', 'start': 104, 'end': 107, 'id': 21},
  {'text': 'the', 'start': 108, 'end': 111, 'id': 22},
  {'text': 'Global', 'start': 112, 'end': 118, 'id': 23},
  {'text': 'Day', 'start': 119, 'end': 122, 'id': 24},
  {'text': 'of', 'start': 123, 'end': 125, 'id': 25},
  {'text': 'Action', 'start': 126, 'end': 132, 'id': 26},
  {'text': 'for', 'start': 133, 'end': 136, 'id': 27},
  {'text': 'the', 'start': 137, 'end': 140, 'id': 28},
  {'text': 'Amazon', 'start': 141, 'end': 147, 'id': 29},
  {'text': 'on', 'start': 148, 'end': 150, 'id': 30},
  {'text': 'Sep', 'start': 151, 'end': 154, 'id': 31},
  {'text': '5th', 'start': 155, 'end': 158, 'id': 32},
  {'text': '!', 'start': 158, 'end': 159, 'id': 33}],
 'spans': [{'text': 'Amazon',
   'start': 141,
   'end': 147,
   'priority': 0.3181818182,
   'score': 0.3181818182,
   'pattern': -1172721813,
   'label': 'ORG',
   'answer': 'reject',
   'token_start': 29,
   'token_end': 29},
  {'text': 'Amazon',
   'start': 141,
   'end': 147,
   'priority': 0.5205479452,
   'score': 0.5205479452,
   'pattern': -1172721813,
   'label': 'ORG',
   'answer': 'reject',
   'token_start': 29,
   'token_end': 29}],
 'meta': {'score': 0.3181818182, 'pattern': 94},
 '_session_id': 'tracker_ner-default',
 '_view_id': 'ner',
 'answer': 'reject'}

I have tried to replicate @honnibal's code for finding precision, recall, etc., and I have also added an accuracy calculator. However, the accuracy is not the same score as the accuracy score give with ner.batch-train, which produced the evaluate.jsonl file that is the source for the dictionary_list variable in the code below.

tp = 0.0
fp = 0.0
tn = 0.0
fn = 0.0
ignored = 0.0
LABEL = "ORG"

for eg in dictionary_list:
    if eg['answer'] == 'ignore':
        ignored += 1
    else:
        doc = nlp(eg['text'])

        guesses = set((ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ == LABEL)
        positives = set((span["start"], span["end"], span["label"]) for span in eg["spans"] if span['answer'] == 'accept')
    
        tp += len(guesses.intersection(positives))
        fn += len(positives - guesses)
        fp += len(guesses - positives)
        if len(guesses) == 0 and len(positives) == 0:
            tn += 1

acc = tp/(tp+fn+fp+tn+1e-10)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)

I am struggling to understand how Prodigy calculates accuracy given the information in the evaluate.jsonl file, and how it deals with conflicting classification score for otherwise identical spans within an example. On top of replicating the accuracy score given by ner.batch-train I would also like to properly calculate precision, recall, and fscore from the evaluate.jsonl file.

I've tried to back engineer how prodigy calculates accuracy in ner.batch-train, but didn't get very far. I noticed that best_stats has a "unk" key value. Is this perhaps when there are conflicting classifications of the same span? Any help appreciated!

Thanks for the comprehensive question @arranjdavis! I am struggling to understand exactly the same things, i.e. how prodigy calculates accuracy, why multiple spans seems to be getting saved in the evaluate file which are identical except for the priority and score values, and how I can calculate precision, recall, and F1 from my evaluate.jsonl given my uncertainty about the above. Any change of an answer @honnibal or @ines? Thanks so much in advance:)

1 Like