Hi @ggilley
That's right. The child
and head
keys' values are references to tokens (using zero-based index). Perhaps it's easier to understand with the complete representation that you can obtain by feeding your example to Prodigy, saving it and inspecting the saved example structure:
{
"text": "1/2 tsp. each: basil, salt and sugar",
"spans": [
{
"text": "1/2",
"start": 0,
"token_start": 0,
"token_end": 0,
"end": 3,
"type": "span",
"label": "AMOUNT"
},
{
"text": "tsp.",
"start": 4,
"token_start": 1,
"token_end": 2,
"end": 8,
"type": "span",
"label": "UNIT"
},
{
"text": "each",
"start": 9,
"token_start": 3,
"token_end": 3,
"end": 13,
"type": "span",
"label": "O"
},
{
"text": ":",
"start": 13,
"token_start": 4,
"token_end": 4,
"end": 14,
"type": "span",
"label": "O"
},
{
"text": "basil",
"start": 15,
"token_start": 5,
"token_end": 5,
"end": 20,
"type": "span",
"label": "ING"
},
{
"text": ",",
"start": 20,
"token_start": 6,
"token_end": 6,
"end": 21,
"type": "span",
"label": "O"
},
{
"text": "salt",
"start": 22,
"token_start": 7,
"token_end": 7,
"end": 26,
"type": "span",
"label": "ING"
},
{
"text": "and",
"start": 27,
"token_start": 8,
"token_end": 8,
"end": 30,
"type": "span",
"label": "AND"
},
{
"text": "sugar",
"start": 31,
"token_start": 9,
"token_end": 9,
"end": 36,
"type": "span",
"label": "ING"
}
],
"relations": [
{
"head": 5,
"child": 7,
"head_span": {
"start": 15,
"end": 20,
"token_start": 5,
"token_end": 5,
"label": "ING"
},
"child_span": {
"start": 22,
"end": 26,
"token_start": 7,
"token_end": 7,
"label": "ING"
},
"color": "#c5bdf4",
"label": "canding"
},
{
"head": 5,
"child": 9,
"head_span": {
"start": 15,
"end": 20,
"token_start": 5,
"token_end": 5,
"label": "ING"
},
"child_span": {
"start": 31,
"end": 36,
"token_start": 9,
"token_end": 9,
"label": "ING"
},
"color": "#c5bdf4",
"label": "canding"
}
],
"_input_hash": -990481556,
"_task_hash": -427806018,
"_is_binary": false,
"tokens": [
{
"text": "1/2",
"start": 0,
"end": 3,
"id": 0,
"ws": true,
"disabled": false
},
{
"text": "tsp",
"start": 4,
"end": 7,
"id": 1,
"ws": false,
"disabled": false
},
{
"text": ".",
"start": 7,
"end": 8,
"id": 2,
"ws": true,
"disabled": false
},
{
"text": "each",
"start": 9,
"end": 13,
"id": 3,
"ws": false,
"disabled": false
},
{
"text": ":",
"start": 13,
"end": 14,
"id": 4,
"ws": true,
"disabled": false
},
{
"text": "basil",
"start": 15,
"end": 20,
"id": 5,
"ws": false,
"disabled": false
},
{
"text": ",",
"start": 20,
"end": 21,
"id": 6,
"ws": true,
"disabled": false
},
{
"text": "salt",
"start": 22,
"end": 26,
"id": 7,
"ws": true,
"disabled": false
},
{
"text": "and",
"start": 27,
"end": 30,
"id": 8,
"ws": true,
"disabled": false
},
{
"text": "sugar",
"start": 31,
"end": 36,
"id": 9,
"ws": false,
"disabled": false
}
],
}
Also, Prodigy is able to render pre-tokenized text so it is not a problem to have custom tokenization. But, yeah, you need to provide tokens
key and use it as reference in your spans
and relations
.
For example:
By default spaCy tokenizes on :
, but I can provide my custom tokens, where :
should not be treated as a separate token:
{"text": "a:b c d", "tokens": [{"text": "a:b", "start": 0, "end": 3, "id": 0},{"text": "c", "start": 3, "end": 4, "id": 1},{"text": "d", "start": 4, "end": 5, "id": 3}], "spans": [{"start": 0, "end": 3, "label": "AB"}, {"start": 4, "end": 5, "label": "D"}], "relations": [{"child": 2, "head": 0, "label": "ABtoD"}]}
And that would render just fine:
.
It's perhaps worth adding that one thing to be mindful about when using custom tokenization is to really double check that the format is correct before annotating lots of examples. And remembering that the same tokenizer must be used in training (sorry for stating the obvious but just wanted to stay on the safe side :P). The simplest way about it, if you are going to train with spaCy, would be to export your data from Prodigy to spaCy with data-to-spacy format and use custom tokenizer component in the spaCy pipeline