Prodigy tokenizing even when not supposed to?

henrye · August 16, 2019, 3:28am

I'm using ner.manual to fix some automatic annotations and have noticed that sometimes prodigy will highlight the wrong token. It appears that this is happening possibly because it's tokenizing the sentence, even when the tokens are already provided. It's not a major hassle but I'm curious if there's anything I'm doing to cause this bug?

{
    "text": "we have been living in italy for the past few years , where mexican food is non-existent , so we were looking forward to eating here . ",
    "meta": {
        "food": "non-existent"
    },
    "tokens": [
        {
            "text": "we",
            "start": 0,
            "end": 2,
            "id": 0
        },
        {
            "text": "have",
            "start": 3,
            "end": 7,
            "id": 1
        },
        {
            "text": "been",
            "start": 8,
            "end": 12,
            "id": 2
        },
        {
            "text": "living",
            "start": 13,
            "end": 19,
            "id": 3
        },
        {
            "text": "in",
            "start": 20,
            "end": 22,
            "id": 4
        },
        {
            "text": "italy",
            "start": 23,
            "end": 28,
            "id": 5
        },
        {
            "text": "for",
            "start": 29,
            "end": 32,
            "id": 6
        },
        {
            "text": "the",
            "start": 33,
            "end": 36,
            "id": 7
        },
        {
            "text": "past",
            "start": 37,
            "end": 41,
            "id": 8
        },
        {
            "text": "few",
            "start": 42,
            "end": 45,
            "id": 9
        },
        {
            "text": "years",
            "start": 46,
            "end": 51,
            "id": 10
        },
        {
            "text": ",",
            "start": 52,
            "end": 53,
            "id": 11
        },
        {
            "text": "where",
            "start": 54,
            "end": 59,
            "id": 12
        },
        {
            "text": "mexican",
            "start": 60,
            "end": 67,
            "id": 13
        },
        {
            "text": "food",
            "start": 68,
            "end": 72,
            "id": 14
        },
        {
            "text": "is",
            "start": 73,
            "end": 75,
            "id": 15
        },
        {
            "text": "non-existent",
            "start": 76,
            "end": 88,
            "id": 16
        },
        {
            "text": ",",
            "start": 89,
            "end": 90,
            "id": 17
        },
        {
            "text": "so",
            "start": 91,
            "end": 93,
            "id": 18
        },
        {
            "text": "we",
            "start": 94,
            "end": 96,
            "id": 19
        },
        {
            "text": "were",
            "start": 97,
            "end": 101,
            "id": 20
        },
        {
            "text": "looking",
            "start": 102,
            "end": 109,
            "id": 21
        },
        {
            "text": "forward",
            "start": 110,
            "end": 117,
            "id": 22
        },
        {
            "text": "to",
            "start": 118,
            "end": 120,
            "id": 23
        },
        {
            "text": "eating",
            "start": 121,
            "end": 127,
            "id": 24
        },
        {
            "text": "here",
            "start": 128,
            "end": 132,
            "id": 25
        },
        {
            "text": ".",
            "start": 133,
            "end": 134,
            "id": 26
        }
    ],
    "spans": [
        {
            "start": 76,
            "end": 88,
            "token_start": 16,
            "token_end": 16,
            "label": "food"
        }
    ]
}

ines · August 16, 2019, 8:40am

Hi! Thanks for the report and sorry about that – I do think there's indeed some weirdness going on at the moment with ner.manual not (always?) accepting pre-defined tokenization, even if tokens and token indices on the spans are provided.

In the meantime, the easiest workaround would be to just uncomment the add_tokens line in the recipe, or add a flag that lets you toggle it (e.g. via a command line setting).

Topic		Replies	Views
ner.manual gives ValueError: Mismatched tokenization. usage , ner , solved	9	1415	August 1, 2019
ner-manual does not use custom tokens ner , done , solved	3	716	January 29, 2020
Tokenization causes glitched text usage , ner , solved	1	376	November 2, 2021
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	553	March 27, 2020

Prodigy tokenizing even when not supposed to?

Related topics