confusion about structure of relations jsonl format

I'm trying to understand the relations jsonl format. Here's an example:

{"text": "1/2 tsp. each: basil, salt and sugar", "spans": [{"start": 0, "end": 3, "label": "AMOUNT"}, {"start": 4, "end": 8, "label": "UNIT"}, {"start": 9, "end": 13, "label": "O"}, {"start": 13, "end": 14, "label": "O"}, {"start": 15, "end": 20, "label": "ING"}, {"start": 20, "end": 21, "label": "O"}, {"start": 22, "end": 26, "label": "ING"}, {"start": 27, "end": 30, "label": "AND"}, {"start": 31, "end": 36, "label": "ING"}], "relations": [{"child": 7, "head": 5, "label": "canding"}, {"child": 9, "head": 5, "label": "canding"}]}

The indexes for spans into text are zero-based. In the examples I see in the documentation, relations are also zero-based. However, in this example, the head and child references seem to be 1 based. If I load this into prodigy, it displays correctly. Is this a special case? Is a relations reference with no tokens field a 1-based reference into spans?

(I'm trying to create and read prodigy-compatible data which is why I need to understand the implementation details :slight_smile: )

Thanks,
Greg

Ah I think I see. relations are against tokens, not spans. spacy tokenizes the text if the tokens aren't present. If I have a different tokenizer, that's going to cause issues.

Hi @ggilley
That's right. The child and headkeys' values are references to tokens (using zero-based index). Perhaps it's easier to understand with the complete representation that you can obtain by feeding your example to Prodigy, saving it and inspecting the saved example structure:

{
  "text": "1/2 tsp. each: basil, salt and sugar",
  "spans": [
    {
      "text": "1/2",
      "start": 0,
      "token_start": 0,
      "token_end": 0,
      "end": 3,
      "type": "span",
      "label": "AMOUNT"
    },
    {
      "text": "tsp.",
      "start": 4,
      "token_start": 1,
      "token_end": 2,
      "end": 8,
      "type": "span",
      "label": "UNIT"
    },
    {
      "text": "each",
      "start": 9,
      "token_start": 3,
      "token_end": 3,
      "end": 13,
      "type": "span",
      "label": "O"
    },
    {
      "text": ":",
      "start": 13,
      "token_start": 4,
      "token_end": 4,
      "end": 14,
      "type": "span",
      "label": "O"
    },
    {
      "text": "basil",
      "start": 15,
      "token_start": 5,
      "token_end": 5,
      "end": 20,
      "type": "span",
      "label": "ING"
    },
    {
      "text": ",",
      "start": 20,
      "token_start": 6,
      "token_end": 6,
      "end": 21,
      "type": "span",
      "label": "O"
    },
    {
      "text": "salt",
      "start": 22,
      "token_start": 7,
      "token_end": 7,
      "end": 26,
      "type": "span",
      "label": "ING"
    },
    {
      "text": "and",
      "start": 27,
      "token_start": 8,
      "token_end": 8,
      "end": 30,
      "type": "span",
      "label": "AND"
    },
    {
      "text": "sugar",
      "start": 31,
      "token_start": 9,
      "token_end": 9,
      "end": 36,
      "type": "span",
      "label": "ING"
    }
  ],
  "relations": [
    {
      "head": 5,
      "child": 7,
      "head_span": {
        "start": 15,
        "end": 20,
        "token_start": 5,
        "token_end": 5,
        "label": "ING"
      },
      "child_span": {
        "start": 22,
        "end": 26,
        "token_start": 7,
        "token_end": 7,
        "label": "ING"
      },
      "color": "#c5bdf4",
      "label": "canding"
    },
    {
      "head": 5,
      "child": 9,
      "head_span": {
        "start": 15,
        "end": 20,
        "token_start": 5,
        "token_end": 5,
        "label": "ING"
      },
      "child_span": {
        "start": 31,
        "end": 36,
        "token_start": 9,
        "token_end": 9,
        "label": "ING"
      },
      "color": "#c5bdf4",
      "label": "canding"
    }
  ],
  "_input_hash": -990481556,
  "_task_hash": -427806018,
  "_is_binary": false,
  "tokens": [
    {
      "text": "1/2",
      "start": 0,
      "end": 3,
      "id": 0,
      "ws": true,
      "disabled": false
    },
    {
      "text": "tsp",
      "start": 4,
      "end": 7,
      "id": 1,
      "ws": false,
      "disabled": false
    },
    {
      "text": ".",
      "start": 7,
      "end": 8,
      "id": 2,
      "ws": true,
      "disabled": false
    },
    {
      "text": "each",
      "start": 9,
      "end": 13,
      "id": 3,
      "ws": false,
      "disabled": false
    },
    {
      "text": ":",
      "start": 13,
      "end": 14,
      "id": 4,
      "ws": true,
      "disabled": false
    },
    {
      "text": "basil",
      "start": 15,
      "end": 20,
      "id": 5,
      "ws": false,
      "disabled": false
    },
    {
      "text": ",",
      "start": 20,
      "end": 21,
      "id": 6,
      "ws": true,
      "disabled": false
    },
    {
      "text": "salt",
      "start": 22,
      "end": 26,
      "id": 7,
      "ws": true,
      "disabled": false
    },
    {
      "text": "and",
      "start": 27,
      "end": 30,
      "id": 8,
      "ws": true,
      "disabled": false
    },
    {
      "text": "sugar",
      "start": 31,
      "end": 36,
      "id": 9,
      "ws": false,
      "disabled": false
    }
  ],
}

Also, Prodigy is able to render pre-tokenized text so it is not a problem to have custom tokenization. But, yeah, you need to provide tokens key and use it as reference in your spansand relations.
For example:
By default spaCy tokenizes on :, but I can provide my custom tokens, where :should not be treated as a separate token:
{"text": "a:b c d", "tokens": [{"text": "a:b", "start": 0, "end": 3, "id": 0},{"text": "c", "start": 3, "end": 4, "id": 1},{"text": "d", "start": 4, "end": 5, "id": 3}], "spans": [{"start": 0, "end": 3, "label": "AB"}, {"start": 4, "end": 5, "label": "D"}], "relations": [{"child": 2, "head": 0, "label": "ABtoD"}]}
And that would render just fine:
image.
It's perhaps worth adding that one thing to be mindful about when using custom tokenization is to really double check that the format is correct before annotating lots of examples. And remembering that the same tokenizer must be used in training (sorry for stating the obvious but just wanted to stay on the safe side :P). The simplest way about it, if you are going to train with spaCy, would be to export your data from Prodigy to spaCy with data-to-spacy format and use custom tokenizer component in the spaCy pipeline

1 Like