Calculating accuracy, precision, recall, and f1 score from evaluation.jsonl file from ner.batch-train

Hello, I am experiencing a similar problem. I also want to calculate accuracy, recall, etc. from the evaluation.jsonl file I am getting from ner.batch-train. The dataset I trained on was created using ner.teach, and it contains multiple spans referencing the same text but sometimes with conflicting model scores (see the below example):

{'text': "RT/SHARE We are SO THRILLED that actions are planned on EVERY continent (except Antarctica, c'mon guys) for the Global Day of Action for the Amazon on Sep 5th!",
 '_input_hash': 953315024,
 '_task_hash': -1148677878,
 'tokens': [{'text': 'RT', 'start': 0, 'end': 2, 'id': 0},
  {'text': '/', 'start': 2, 'end': 3, 'id': 1},
  {'text': 'SHARE', 'start': 3, 'end': 8, 'id': 2},
  {'text': 'We', 'start': 9, 'end': 11, 'id': 3},
  {'text': 'are', 'start': 12, 'end': 15, 'id': 4},
  {'text': 'SO', 'start': 16, 'end': 18, 'id': 5},
  {'text': 'THRILLED', 'start': 19, 'end': 27, 'id': 6},
  {'text': 'that', 'start': 28, 'end': 32, 'id': 7},
  {'text': 'actions', 'start': 33, 'end': 40, 'id': 8},
  {'text': 'are', 'start': 41, 'end': 44, 'id': 9},
  {'text': 'planned', 'start': 45, 'end': 52, 'id': 10},
  {'text': 'on', 'start': 53, 'end': 55, 'id': 11},
  {'text': 'EVERY', 'start': 56, 'end': 61, 'id': 12},
  {'text': 'continent', 'start': 62, 'end': 71, 'id': 13},
  {'text': '(', 'start': 72, 'end': 73, 'id': 14},
  {'text': 'except', 'start': 73, 'end': 79, 'id': 15},
  {'text': 'Antarctica', 'start': 80, 'end': 90, 'id': 16},
  {'text': ',', 'start': 90, 'end': 91, 'id': 17},
  {'text': "c'mon", 'start': 92, 'end': 97, 'id': 18},
  {'text': 'guys', 'start': 98, 'end': 102, 'id': 19},
  {'text': ')', 'start': 102, 'end': 103, 'id': 20},
  {'text': 'for', 'start': 104, 'end': 107, 'id': 21},
  {'text': 'the', 'start': 108, 'end': 111, 'id': 22},
  {'text': 'Global', 'start': 112, 'end': 118, 'id': 23},
  {'text': 'Day', 'start': 119, 'end': 122, 'id': 24},
  {'text': 'of', 'start': 123, 'end': 125, 'id': 25},
  {'text': 'Action', 'start': 126, 'end': 132, 'id': 26},
  {'text': 'for', 'start': 133, 'end': 136, 'id': 27},
  {'text': 'the', 'start': 137, 'end': 140, 'id': 28},
  {'text': 'Amazon', 'start': 141, 'end': 147, 'id': 29},
  {'text': 'on', 'start': 148, 'end': 150, 'id': 30},
  {'text': 'Sep', 'start': 151, 'end': 154, 'id': 31},
  {'text': '5th', 'start': 155, 'end': 158, 'id': 32},
  {'text': '!', 'start': 158, 'end': 159, 'id': 33}],
 'spans': [{'text': 'Amazon',
   'start': 141,
   'end': 147,
   'priority': 0.3181818182,
   'score': 0.3181818182,
   'pattern': -1172721813,
   'label': 'ORG',
   'answer': 'reject',
   'token_start': 29,
   'token_end': 29},
  {'text': 'Amazon',
   'start': 141,
   'end': 147,
   'priority': 0.5205479452,
   'score': 0.5205479452,
   'pattern': -1172721813,
   'label': 'ORG',
   'answer': 'reject',
   'token_start': 29,
   'token_end': 29}],
 'meta': {'score': 0.3181818182, 'pattern': 94},
 '_session_id': 'tracker_ner-default',
 '_view_id': 'ner',
 'answer': 'reject'}

I have tried to replicate @honnibal's code for finding precision, recall, etc., and I have also added an accuracy calculator. However, the accuracy is not the same score as the accuracy score give with ner.batch-train, which produced the evaluate.jsonl file that is the source for the dictionary_list variable in the code below.

tp = 0.0
fp = 0.0
tn = 0.0
fn = 0.0
ignored = 0.0
LABEL = "ORG"

for eg in dictionary_list:
    if eg['answer'] == 'ignore':
        ignored += 1
    else:
        doc = nlp(eg['text'])

        guesses = set((ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ == LABEL)
        positives = set((span["start"], span["end"], span["label"]) for span in eg["spans"] if span['answer'] == 'accept')
    
        tp += len(guesses.intersection(positives))
        fn += len(positives - guesses)
        fp += len(guesses - positives)
        if len(guesses) == 0 and len(positives) == 0:
            tn += 1

acc = tp/(tp+fn+fp+tn+1e-10)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)

I am struggling to understand how Prodigy calculates accuracy given the information in the evaluate.jsonl file, and how it deals with conflicting classification score for otherwise identical spans within an example. On top of replicating the accuracy score given by ner.batch-train I would also like to properly calculate precision, recall, and fscore from the evaluate.jsonl file.

I've tried to back engineer how prodigy calculates accuracy in ner.batch-train, but didn't get very far. I noticed that best_stats has a "unk" key value. Is this perhaps when there are conflicting classifications of the same span? Any help appreciated!