Thank you Ines and team - this is an amazing tool! I am encountering a strange issue after exporting an annotated dataset. I am wondering if I am doing something wrong or if this is a bug.
After running db-out, a significant number of annotated spans (half in my sample) are missing the "text" attribute in the exported jsonl. In the first span below, "Analyze" is the piece of text that was labeled as "Analyze_Review". The second and third spans have labels of "Analyze_Review" and "Communication_Confer", respectively, but they are missing the text
field. I can likely grab the annotated text using token_start
and token_end
. But why is the text present in some records and not others? I am using a patterns file of ~200 patterns in case that makes any difference.
Thanks in advance for any insight.
Prodigy version: 1.11.7
Prodigy command: prodigy ner.manual activities_demo en_core_web_lg ./assets/activities.jsonl -pt ./assets/activity_patterns.jsonl --label ./assets/activity_labels.txt
Example output:
{
"text": "Analyze and review updated draft of assignment agreement and related documents and emails to Bugs Bunny and Elmer Fudd.",
"meta": {
"id": "1234",
"original_text": "Analyze and review updated draft of assignment agreement and related documents and emails to Bugs Bunny and Elmer Fudd.",
"pattern": "0"
},
"_input_hash": -977724056,
"_task_hash": -665543633,
"_is_binary": false,
"spans": [
{
"text": "Analyze",
"start": 0,
"end": 7,
"pattern": -757930524,
"token_start": 0,
"token_end": 0,
"label": "Analyze_Review"
},
{
"start": 12,
"end": 18,
"token_start": 2,
"token_end": 2,
"label": "Analyze_Review"
},
{
"start": 83,
"end": 108,
"token_start": 12,
"token_end": 15,
"label": "Communication_Confer_Personal"
}
],
"tokens": [
{ "text": "Analyze", "start": 0, "end": 7, "id": 0, "ws": true },
{ "text": "and", "start": 8, "end": 11, "id": 1, "ws": true },
{ "text": "review", "start": 12, "end": 18, "id": 2, "ws": true },
{ "text": "updated", "start": 19, "end": 26, "id": 3, "ws": true },
{ "text": "draft", "start": 27, "end": 32, "id": 4, "ws": true },
{ "text": "of", "start": 33, "end": 35, "id": 5, "ws": true },
{ "text": "assignment", "start": 36, "end": 46, "id": 6, "ws": true },
{ "text": "agreement", "start": 47, "end": 56, "id": 7, "ws": true },
{ "text": "and", "start": 57, "end": 60, "id": 8, "ws": true },
{ "text": "related", "start": 61, "end": 68, "id": 9, "ws": true },
{ "text": "documents", "start": 69, "end": 78, "id": 10, "ws": true },
{ "text": "and", "start": 79, "end": 82, "id": 11, "ws": true },
{ "text": "emails", "start": 83, "end": 89, "id": 12, "ws": true },
{ "text": "to", "start": 90, "end": 92, "id": 13, "ws": true },
{ "text": "Bugs", "start": 93, "end": 99, "id": 14, "ws": true },
{ "text": "Bunny", "start": 100, "end": 108, "id": 15, "ws": true },
{ "text": "and", "start": 109, "end": 112, "id": 16, "ws": true },
{ "text": "Elmer", "start": 113, "end": 117, "id": 17, "ws": true },
{ "text": "Fudd", "start": 118, "end": 125, "id": 18, "ws": false },
{ "text": ".", "start": 125, "end": 126, "id": 19, "ws": false }
],
"_view_id": "ner_manual",
"answer": "accept",
"_timestamp": 1667993974,
"_annotator_id": "annotator1",
"_session_id": "annotator1"
}