Hi. I'm having two issues. The output of db-out when I save annotations produces empty spans and spans with no 'text' attribute, while when I used db-out before, spans were an empty list ([ ]) or a list of dictionaries with a 'text' attribute (key). I need the 'text' attribute to know what the span refers to in the input text. Here is a minimal working example where I try to recreate the problem.
First, here is the command I use to start annotations:
python3 -m prodigy ner.manual identify_dosage_non_dosage_validate_data_SB2 en_core_web_lg validate_data_dosage_annotations_SB2.jsonl --label non_dosage,dosage
which calls the file:
validate_data_dosage_annotations_SB2.jsonl (52.3 KB)
Next, I perform the annotations, annotating text as a dosage or non-dosage.
Then I save the annotations with
python3 -m prodigy db-out identify_dosage_non_dosage_validate_data_SB2 > validate_data_dosage_non_dosage_annotations_SB2.jsonl
The annotations are saved here:
validate_data_dosage_non_dosage_annotations_SB2.jsonl (53.1 KB)
Now, in order to visualize the annotations in a spreadsheet, I use this script in Python:
import pandas as pd
df_jsonl_annotations = pd.read_json('validate_data_dosage_non_dosage_annotations_SB2.jsonl', lines=True)
df_jsonl_annotations.to_csv('validate_data_dosage_non_dosage_annotations_SB2.csv', index=False)
The results can be seen here (originally a CSV file) (which I have truncated for readability):
text _input_hash _task_hash _is_binary tokens _view_id answer _timestamp spans
January 9 - 241 6 - 375mg split into 3 doses. 96m deadlift/back/shoulder session. 30m cardio. 7,872 steps. 1,640 calories at 17g (7g net) carbs, 93g fat, 128g protein. 1 5g water January 10 - 241 4 - 375mg 1201376478 -478339982 FALSE [{'text': 'January', 'start': 0, 'end': 7, 'id': 0, 'ws': True}, {'text': '9', 'start': 8, 'end': 9, 'id': 1, 'ws': True}, {'text': '-', 'start': 10, 'end': 11, 'id': 2, 'ws': True} ner_manual accept 1673316381 [{'start': 18, 'end': 25, 'token_start': 5, 'token_end': 7, 'label': 'non_dosage'}, {'start': 125, 'end': 143, 'token_start': 32, 'token_end': 39, 'label': 'non_dosage'}
Originally Posted by itismethebeeFirst off, I turned 18 this year. -681172102 1046073490 FALSE [{'text': ' ', 'start': 0, 'end': 1, 'id': 0, 'ws': False}, {'text': 'Originally', 'start': 1, 'end': 11, 'id': 1, 'ws': True}, {'text': 'Posted', 'start': 12, 'end': 18, 'id': 2, 'ws': True}
d': 484 'ws': True} {'text': 'this' 'start': 2191 'end': 2195 'id': 485 'ws': True} {'text': 'post'
As you can see, the first line contains spans with no 'text' attribute, while the second contains empty spans, which I don't think should be the case.