Hello,
I am trying to calculate accuracy, precision, recall, and f1 scores for a model created using ner.batch-train
on data annotated using ner.teach
.
I am using the outputted evalutation.jsonl
file, and I am a bit confused by the spans
section of the dictionaries. Take this as an example:
{'text': 'if thats the case thats as true as me saying im richer than microsoft, sony, epic games, apple, samsung combined... wouldnt that be a life to live XD', '_input_hash': -1603676578, '_task_hash': 958550217, 'tokens': [{'text': 'if', 'start': 0, 'end': 2, 'id': 0}, {'text': 'that', 'start': 3, 'end': 7, 'id': 1}, {'text': 's', 'start': 7, 'end': 8, 'id': 2}, {'text': 'the', 'start': 9, 'end': 12, 'id': 3}, {'text': 'case', 'start': 13, 'end': 17, 'id': 4}, {'text': 'that', 'start': 18, 'end': 22, 'id': 5}, {'text': 's', 'start': 22, 'end': 23, 'id': 6}, {'text': 'as', 'start': 24, 'end': 26, 'id': 7}, {'text': 'true', 'start': 27, 'end': 31, 'id': 8}, {'text': 'as', 'start': 32, 'end': 34, 'id': 9}, {'text': 'me', 'start': 35, 'end': 37, 'id': 10}, {'text': 'saying', 'start': 38, 'end': 44, 'id': 11}, {'text': 'i', 'start': 45, 'end': 46, 'id': 12}, {'text': 'm', 'start': 46, 'end': 47, 'id': 13}, {'text': 'richer', 'start': 48, 'end': 54, 'id': 14}, {'text': 'than', 'start': 55, 'end': 59, 'id': 15}, {'text': 'microsoft', 'start': 60, 'end': 69, 'id': 16}, {'text': ',', 'start': 69, 'end': 70, 'id': 17}, {'text': 'sony', 'start': 71, 'end': 75, 'id': 18}, {'text': ',', 'start': 75, 'end': 76, 'id': 19}, {'text': 'epic', 'start': 77, 'end': 81, 'id': 20}, {'text': 'games', 'start': 82, 'end': 87, 'id': 21}, {'text': ',', 'start': 87, 'end': 88, 'id': 22}, {'text': 'apple', 'start': 89, 'end': 94, 'id': 23}, {'text': ',', 'start': 94, 'end': 95, 'id': 24}, {'text': 'samsung', 'start': 96, 'end': 103, 'id': 25}, {'text': 'combined', 'start': 104, 'end': 112, 'id': 26}, {'text': '...', 'start': 112, 'end': 115, 'id': 27}, {'text': 'would', 'start': 116, 'end': 121, 'id': 28}, {'text': 'nt', 'start': 121, 'end': 123, 'id': 29}, {'text': 'that', 'start': 124, 'end': 128, 'id': 30}, {'text': 'be', 'start': 129, 'end': 131, 'id': 31}, {'text': 'a', 'start': 132, 'end': 133, 'id': 32}, {'text': 'life', 'start': 134, 'end': 138, 'id': 33}, {'text': 'to', 'start': 139, 'end': 141, 'id': 34}, {'text': 'live', 'start': 142, 'end': 146, 'id': 35}, {'text': 'XD', 'start': 147, 'end': 149, 'id': 36}], 'spans': [{'text': 'sony', 'start': 71, 'end': 75, 'priority': 0.5, 'score': 0.5, 'pattern': 422137312, 'label': 'ORG', 'answer': 'accept', 'token_start': 18, 'token_end': 18}, {'text': 'microsoft', 'start': 60, 'end': 69, 'priority': 0.4444444444, 'score': 0.4444444444, 'pattern': -1043368242, 'label': 'ORG', 'answer': 'accept', 'token_start': 16, 'token_end': 16}, {'text': 'microsoft', 'start': 60, 'end': 69, 'priority': 0.5555555556, 'score': 0.5555555556, 'pattern': -1043368242, 'label': 'ORG', 'answer': 'accept', 'token_start': 16, 'token_end': 16}, {'text': 'microsoft', 'start': 60, 'end': 69, 'priority': 0.5, 'score': 0.5, 'pattern': -1043368242, 'label': 'ORG', 'answer': 'accept', 'token_start': 16, 'token_end': 16}], 'meta': {'score': 0.5, 'pattern': 75}, '_session_id': 'tracker_ner-default', '_view_id': 'ner', 'answer': 'accept'}
My questions are:
-
Why does 'microsoft' appear multiple times in
spans
with the exact same information except differentscore
values? -
How do you go about calculating accuracy, precision, etc. from this information? What is the cut-off used for the
score
variable (e.g., <.5, <=.5, etc.)? Of course, I am aiming to have my model accuracy score match that printed by theprodigy ner.batch-train
command.
Thanks!
PS the formatting on the dictionary is a bit ugly, here is a photo that might make things more clear: