confusion about structure of relations jsonl format

ggilley · March 30, 2023, 11:29pm

I'm trying to understand the relations jsonl format. Here's an example:

{"text": "1/2 tsp. each: basil, salt and sugar", "spans": [{"start": 0, "end": 3, "label": "AMOUNT"}, {"start": 4, "end": 8, "label": "UNIT"}, {"start": 9, "end": 13, "label": "O"}, {"start": 13, "end": 14, "label": "O"}, {"start": 15, "end": 20, "label": "ING"}, {"start": 20, "end": 21, "label": "O"}, {"start": 22, "end": 26, "label": "ING"}, {"start": 27, "end": 30, "label": "AND"}, {"start": 31, "end": 36, "label": "ING"}], "relations": [{"child": 7, "head": 5, "label": "canding"}, {"child": 9, "head": 5, "label": "canding"}]}

The indexes for spans into text are zero-based. In the examples I see in the documentation, relations are also zero-based. However, in this example, the head and child references seem to be 1 based. If I load this into prodigy, it displays correctly. Is this a special case? Is a relations reference with no tokens field a 1-based reference into spans?

(I'm trying to create and read prodigy-compatible data which is why I need to understand the implementation details )

Thanks,
Greg

ggilley · March 31, 2023, 2:25am

Ah I think I see. relations are against tokens, not spans. spacy tokenizes the text if the tokens aren't present. If I have a different tokenizer, that's going to cause issues.

magdaaniol · March 31, 2023, 9:43am

Hi @ggilley
That's right. The child and headkeys' values are references to tokens (using zero-based index). Perhaps it's easier to understand with the complete representation that you can obtain by feeding your example to Prodigy, saving it and inspecting the saved example structure:

{
  "text": "1/2 tsp. each: basil, salt and sugar",
  "spans": [
    {
      "text": "1/2",
      "start": 0,
      "token_start": 0,
      "token_end": 0,
      "end": 3,
      "type": "span",
      "label": "AMOUNT"
    },
    {
      "text": "tsp.",
      "start": 4,
      "token_start": 1,
      "token_end": 2,
      "end": 8,
      "type": "span",
      "label": "UNIT"
    },
    {
      "text": "each",
      "start": 9,
      "token_start": 3,
      "token_end": 3,
      "end": 13,
      "type": "span",
      "label": "O"
    },
    {
      "text": ":",
      "start": 13,
      "token_start": 4,
      "token_end": 4,
      "end": 14,
      "type": "span",
      "label": "O"
    },
    {
      "text": "basil",
      "start": 15,
      "token_start": 5,
      "token_end": 5,
      "end": 20,
      "type": "span",
      "label": "ING"
    },
    {
      "text": ",",
      "start": 20,
      "token_start": 6,
      "token_end": 6,
      "end": 21,
      "type": "span",
      "label": "O"
    },
    {
      "text": "salt",
      "start": 22,
      "token_start": 7,
      "token_end": 7,
      "end": 26,
      "type": "span",
      "label": "ING"
    },
    {
      "text": "and",
      "start": 27,
      "token_start": 8,
      "token_end": 8,
      "end": 30,
      "type": "span",
      "label": "AND"
    },
    {
      "text": "sugar",
      "start": 31,
      "token_start": 9,
      "token_end": 9,
      "end": 36,
      "type": "span",
      "label": "ING"
    }
  ],
  "relations": [
    {
      "head": 5,
      "child": 7,
      "head_span": {
        "start": 15,
        "end": 20,
        "token_start": 5,
        "token_end": 5,
        "label": "ING"
      },
      "child_span": {
        "start": 22,
        "end": 26,
        "token_start": 7,
        "token_end": 7,
        "label": "ING"
      },
      "color": "#c5bdf4",
      "label": "canding"
    },
    {
      "head": 5,
      "child": 9,
      "head_span": {
        "start": 15,
        "end": 20,
        "token_start": 5,
        "token_end": 5,
        "label": "ING"
      },
      "child_span": {
        "start": 31,
        "end": 36,
        "token_start": 9,
        "token_end": 9,
        "label": "ING"
      },
      "color": "#c5bdf4",
      "label": "canding"
    }
  ],
  "_input_hash": -990481556,
  "_task_hash": -427806018,
  "_is_binary": false,
  "tokens": [
    {
      "text": "1/2",
      "start": 0,
      "end": 3,
      "id": 0,
      "ws": true,
      "disabled": false
    },
    {
      "text": "tsp",
      "start": 4,
      "end": 7,
      "id": 1,
      "ws": false,
      "disabled": false
    },
    {
      "text": ".",
      "start": 7,
      "end": 8,
      "id": 2,
      "ws": true,
      "disabled": false
    },
    {
      "text": "each",
      "start": 9,
      "end": 13,
      "id": 3,
      "ws": false,
      "disabled": false
    },
    {
      "text": ":",
      "start": 13,
      "end": 14,
      "id": 4,
      "ws": true,
      "disabled": false
    },
    {
      "text": "basil",
      "start": 15,
      "end": 20,
      "id": 5,
      "ws": false,
      "disabled": false
    },
    {
      "text": ",",
      "start": 20,
      "end": 21,
      "id": 6,
      "ws": true,
      "disabled": false
    },
    {
      "text": "salt",
      "start": 22,
      "end": 26,
      "id": 7,
      "ws": true,
      "disabled": false
    },
    {
      "text": "and",
      "start": 27,
      "end": 30,
      "id": 8,
      "ws": true,
      "disabled": false
    },
    {
      "text": "sugar",
      "start": 31,
      "end": 36,
      "id": 9,
      "ws": false,
      "disabled": false
    }
  ],
}

Also, Prodigy is able to render pre-tokenized text so it is not a problem to have custom tokenization. But, yeah, you need to provide tokens key and use it as reference in your spansand relations.
For example:
By default spaCy tokenizes on :, but I can provide my custom tokens, where :should not be treated as a separate token:
{"text": "a:b c d", "tokens": [{"text": "a:b", "start": 0, "end": 3, "id": 0},{"text": "c", "start": 3, "end": 4, "id": 1},{"text": "d", "start": 4, "end": 5, "id": 3}], "spans": [{"start": 0, "end": 3, "label": "AB"}, {"start": 4, "end": 5, "label": "D"}], "relations": [{"child": 2, "head": 0, "label": "ABtoD"}]}
And that would render just fine:
.
It's perhaps worth adding that one thing to be mindful about when using custom tokenization is to really double check that the format is correct before annotating lots of examples. And remembering that the same tokenizer must be used in training (sorry for stating the obvious but just wanted to stay on the safe side :P). The simplest way about it, if you are going to train with spaCy, would be to export your data from Prodigy to spaCy with data-to-spacy format and use custom tokenizer component in the spaCy pipeline

Topic		Replies	Views
Possible bug - relations view - cannot select certain tokens bug , front-end , solved , relations	12	489	May 2, 2022
relation recipe missing span annotation on custom tokens because of tokenization didnt match relations , spancat	1	350	September 15, 2022
Tokenization compatibility issues in rel.manual enhancement , usage , done , transformers , relations	7	1427	September 8, 2020
Token indices in NER jsonl format usage , ner , solved	1	534	May 20, 2019
Dependency parsing with multiple keywords dep , relations	2	273	April 11, 2023

confusion about structure of relations jsonl format

Related topics