Exported annotations missing text

Thank you Ines and team - this is an amazing tool! I am encountering a strange issue after exporting an annotated dataset. I am wondering if I am doing something wrong or if this is a bug.

After running db-out, a significant number of annotated spans (half in my sample) are missing the "text" attribute in the exported jsonl. In the first span below, "Analyze" is the piece of text that was labeled as "Analyze_Review". The second and third spans have labels of "Analyze_Review" and "Communication_Confer", respectively, but they are missing the text field. I can likely grab the annotated text using token_start and token_end. But why is the text present in some records and not others? I am using a patterns file of ~200 patterns in case that makes any difference.

Thanks in advance for any insight.

Prodigy version: 1.11.7
Prodigy command: prodigy ner.manual activities_demo en_core_web_lg ./assets/activities.jsonl -pt ./assets/activity_patterns.jsonl --label ./assets/activity_labels.txt
Example output:

{
  "text": "Analyze and review updated draft of assignment agreement and related documents and emails to Bugs Bunny and Elmer Fudd.",
  "meta": {
    "id": "1234",
    "original_text": "Analyze and review updated draft of assignment agreement and related documents and emails to Bugs Bunny and Elmer Fudd.",
    "pattern": "0"
  },
  "_input_hash": -977724056,
  "_task_hash": -665543633,
  "_is_binary": false,
  "spans": [
    {
      "text": "Analyze",
      "start": 0,
      "end": 7,
      "pattern": -757930524,
      "token_start": 0,
      "token_end": 0,
      "label": "Analyze_Review"
    },
    { 
      "start": 12,
      "end": 18,
      "token_start": 2,
      "token_end": 2,
      "label": "Analyze_Review"
    },
    {
      
      "start": 83,
      "end": 108,
      "token_start": 12,
      "token_end": 15,
      "label": "Communication_Confer_Personal"
    }
  ],
  "tokens": [
    { "text": "Analyze", "start": 0, "end": 7, "id": 0, "ws": true },
    { "text": "and", "start": 8, "end": 11, "id": 1, "ws": true },
    { "text": "review", "start": 12, "end": 18, "id": 2, "ws": true },
    { "text": "updated", "start": 19, "end": 26, "id": 3, "ws": true },
    { "text": "draft", "start": 27, "end": 32, "id": 4, "ws": true },
    { "text": "of", "start": 33, "end": 35, "id": 5, "ws": true },
    { "text": "assignment", "start": 36, "end": 46, "id": 6, "ws": true },
    { "text": "agreement", "start": 47, "end": 56, "id": 7, "ws": true },
    { "text": "and", "start": 57, "end": 60, "id": 8, "ws": true },
    { "text": "related", "start": 61, "end": 68, "id": 9, "ws": true },
    { "text": "documents", "start": 69, "end": 78, "id": 10, "ws": true },
    { "text": "and", "start": 79, "end": 82, "id": 11, "ws": true },
    { "text": "emails", "start": 83, "end": 89, "id": 12, "ws": true },
    { "text": "to", "start": 90, "end": 92, "id": 13, "ws": true },
    { "text": "Bugs", "start": 93, "end": 99, "id": 14, "ws": true },
    { "text": "Bunny", "start": 100, "end": 108, "id": 15, "ws": true },
    { "text": "and", "start": 109, "end": 112, "id": 16, "ws": true },
    { "text": "Elmer", "start": 113, "end": 117, "id": 17, "ws": true },
    { "text": "Fudd", "start": 118, "end": 125, "id": 18, "ws": false },
    { "text": ".", "start": 125, "end": 126, "id": 19, "ws": false }
  ],
  "_view_id": "ner_manual",
  "answer": "accept",
  "_timestamp": 1667993974,
  "_annotator_id": "annotator1",
  "_session_id": "annotator1"
}

hi @connor!

Thanks for your message and welcome to the Prodigy community :wave:

It looks like the ner_manual annotation interface doesn't capture "text" for annotated spans by default.

We have an Annotation interface docs page that lists out all of the interfaces and has an example of what the output will look like.

For example, for ner_manual, the interface may look like:

while the default output would be:

{
  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
  ],
  "tokens": [
    {"text": "First", "start": 0, "end": 5, "id": 0},
    {"text": "look", "start": 6, "end": 10, "id": 1},
    {"text": "at", "start": 11, "end": 13, "id": 2},
    {"text": "the", "start": 14, "end": 17, "id": 3},
    {"text": "new", "start": 18, "end": 21, "id": 4},
    {"text": "MacBook", "start": 22, "end": 29, "id": 5},
    {"text": "Pro", "start": 30, "end": 33, "id": 6}
  ]
}

Notice it doesn't have the "text". Instead, you only get the text for some when your pattern matches (e.g., you can see the additional "pattern" key to indicate which pattern matched).

I'm not 100% sure why but I would suspect that the raw "text" isn't needed in spaCy for training. Thus the idea was to leave it out to minimize unnecessary space (e.g., if you had a really large corpus with many annotated spans, it could inflate the size).

Hopefully that helps to confirm that this is an intended behavior.

Also, note in the ner_manual interface this point:

Note that the "token_end" value of the spans is inclusive and not exclusive (like spaCy’s token indices for Span objects or list indices in Python). So a span with start 5 and end 6 will include the tokens 5 and 6 and the token span in spaCy would be doc[token_start : token_end + 1] . We’re hoping to make this consistent in the future, but it’d be a breaking change and require a new version of Prodigy’s data format.

This tricked me too at some point it's important to be aware if you switch between spaCy and Prodigy. Hopefully this may prevent confusion down the road if you manually calculate text.

Thank you @ryanwesslen! I noticed the text was only present in instances where a pattern matched. I really appreciate the thorough explanation.