Hello, I am experiencing a similar problem. I also want to calculate accuracy, recall, etc. from the evaluation.jsonl
file I am getting from ner.batch-train
. The dataset I trained on was created using ner.teach
, and it contains multiple spans referencing the same text but sometimes with conflicting model scores (see the below example):
{'text': "RT/SHARE We are SO THRILLED that actions are planned on EVERY continent (except Antarctica, c'mon guys) for the Global Day of Action for the Amazon on Sep 5th!",
'_input_hash': 953315024,
'_task_hash': -1148677878,
'tokens': [{'text': 'RT', 'start': 0, 'end': 2, 'id': 0},
{'text': '/', 'start': 2, 'end': 3, 'id': 1},
{'text': 'SHARE', 'start': 3, 'end': 8, 'id': 2},
{'text': 'We', 'start': 9, 'end': 11, 'id': 3},
{'text': 'are', 'start': 12, 'end': 15, 'id': 4},
{'text': 'SO', 'start': 16, 'end': 18, 'id': 5},
{'text': 'THRILLED', 'start': 19, 'end': 27, 'id': 6},
{'text': 'that', 'start': 28, 'end': 32, 'id': 7},
{'text': 'actions', 'start': 33, 'end': 40, 'id': 8},
{'text': 'are', 'start': 41, 'end': 44, 'id': 9},
{'text': 'planned', 'start': 45, 'end': 52, 'id': 10},
{'text': 'on', 'start': 53, 'end': 55, 'id': 11},
{'text': 'EVERY', 'start': 56, 'end': 61, 'id': 12},
{'text': 'continent', 'start': 62, 'end': 71, 'id': 13},
{'text': '(', 'start': 72, 'end': 73, 'id': 14},
{'text': 'except', 'start': 73, 'end': 79, 'id': 15},
{'text': 'Antarctica', 'start': 80, 'end': 90, 'id': 16},
{'text': ',', 'start': 90, 'end': 91, 'id': 17},
{'text': "c'mon", 'start': 92, 'end': 97, 'id': 18},
{'text': 'guys', 'start': 98, 'end': 102, 'id': 19},
{'text': ')', 'start': 102, 'end': 103, 'id': 20},
{'text': 'for', 'start': 104, 'end': 107, 'id': 21},
{'text': 'the', 'start': 108, 'end': 111, 'id': 22},
{'text': 'Global', 'start': 112, 'end': 118, 'id': 23},
{'text': 'Day', 'start': 119, 'end': 122, 'id': 24},
{'text': 'of', 'start': 123, 'end': 125, 'id': 25},
{'text': 'Action', 'start': 126, 'end': 132, 'id': 26},
{'text': 'for', 'start': 133, 'end': 136, 'id': 27},
{'text': 'the', 'start': 137, 'end': 140, 'id': 28},
{'text': 'Amazon', 'start': 141, 'end': 147, 'id': 29},
{'text': 'on', 'start': 148, 'end': 150, 'id': 30},
{'text': 'Sep', 'start': 151, 'end': 154, 'id': 31},
{'text': '5th', 'start': 155, 'end': 158, 'id': 32},
{'text': '!', 'start': 158, 'end': 159, 'id': 33}],
'spans': [{'text': 'Amazon',
'start': 141,
'end': 147,
'priority': 0.3181818182,
'score': 0.3181818182,
'pattern': -1172721813,
'label': 'ORG',
'answer': 'reject',
'token_start': 29,
'token_end': 29},
{'text': 'Amazon',
'start': 141,
'end': 147,
'priority': 0.5205479452,
'score': 0.5205479452,
'pattern': -1172721813,
'label': 'ORG',
'answer': 'reject',
'token_start': 29,
'token_end': 29}],
'meta': {'score': 0.3181818182, 'pattern': 94},
'_session_id': 'tracker_ner-default',
'_view_id': 'ner',
'answer': 'reject'}
I have tried to replicate @honnibal's code for finding precision, recall, etc., and I have also added an accuracy calculator. However, the accuracy is not the same score as the accuracy score give with ner.batch-train
, which produced the evaluate.jsonl
file that is the source for the dictionary_list
variable in the code below.
tp = 0.0
fp = 0.0
tn = 0.0
fn = 0.0
ignored = 0.0
LABEL = "ORG"
for eg in dictionary_list:
if eg['answer'] == 'ignore':
ignored += 1
else:
doc = nlp(eg['text'])
guesses = set((ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ == LABEL)
positives = set((span["start"], span["end"], span["label"]) for span in eg["spans"] if span['answer'] == 'accept')
tp += len(guesses.intersection(positives))
fn += len(positives - guesses)
fp += len(guesses - positives)
if len(guesses) == 0 and len(positives) == 0:
tn += 1
acc = tp/(tp+fn+fp+tn+1e-10)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)
I am struggling to understand how Prodigy calculates accuracy given the information in the evaluate.jsonl
file, and how it deals with conflicting classification score for otherwise identical spans within an example. On top of replicating the accuracy score given by ner.batch-train
I would also like to properly calculate precision, recall, and fscore from the evaluate.jsonl
file.
I've tried to back engineer how prodigy calculates accuracy in ner.batch-train
, but didn't get very far. I noticed that best_stats
has a "unk"
key value. Is this perhaps when there are conflicting classifications of the same span? Any help appreciated!