Tokenization causes glitched text

aelkholy · October 29, 2021, 6:49pm

I am running ner.manual on version 1.11 of prodigy. The command I'm using is this:

PRODIGY_LOGGING=basic prodigy ner.manual blank:en --label

I generate the file with the tokens and spans and we've been verifying the tokenization/spans match and are correct, but we end up with prodigy giving us a messed up view of the text. Any ideas?

candidates_comentions_10.29_test.jsonl (17.3 KB)

ljvmiranda921 · November 2, 2021, 11:48pm

Hi @aelkholy

Thanks for waiting! It seems that there are some inconsistencies with the JSONL file, especially on how the token indices and character offsets are defined. Ideally, in your JSONL:

The list of tokens should contain the character offsets.
The list of spans should contain the character offsets in start and end, and the token indices in token_start and token_end. The token indices are the ids in the token list.

Here's where I found some inconsistencies. In the first example, we have the following token:

# excerpt
    {
      "text": "JP Morgan Chase & Co.",
      "start": 251,
      "end": 272,
      "id": 36
    },

However, when I checked the spans, I saw something like this:

# excerpt
    {
      "end": 740,
      "label": "NP",
      "start": 737,
      "token_end": 36,
      "token_start": 36
    },

Which seems to refer to the JP Morgan ID (i.e. 36) but with the wrong character offsets (740 and 737). The label also seems incorrect.

You might also notice in the token list that the ids repeat. If I search for "id": 36, the tokens JP Morgan Chase & Co and IIM showed up:

    # excerpt (the IDs should be unique)
    {
      "text": "JP Morgan Chase & Co.",
      "start": 251,
      "end": 272,
      "id": 36
    },
    ... 
    {
      "text": "IIM",
      "start": 737,
      "end": 740,
      "id": 36
    },

My recommendation is to cleanup the IDs first and resolve the inconsistencies. Ideally, the IDs should be unique and that the offsets + spans are correct so that they'll render properly.

Topic		Replies	Views
Got missing and glitched text in UI ner , transformers , relations	3	457	November 10, 2021
ner-manual does not use custom tokens ner , done , solved	3	714	January 29, 2020
Prodigy tokenizing even when not supposed to? ner , done	1	543	August 16, 2019
Mismatching spans usage , ner , solved	3	336	July 15, 2021
Skip mismatched tokenization? usage , ner , spacy , solved	2	396	February 8, 2022

Tokenization causes glitched text

Related topics