Tokenization causes glitched text

I am running ner.manual on version 1.11 of prodigy. The command I'm using is this:

PRODIGY_LOGGING=basic prodigy ner.manual blank:en --label

I generate the file with the tokens and spans and we've been verifying the tokenization/spans match and are correct, but we end up with prodigy giving us a messed up view of the text. Any ideas?


candidates_comentions_10.29_test.jsonl (17.3 KB)

Hi @aelkholy

Thanks for waiting! It seems that there are some inconsistencies with the JSONL file, especially on how the token indices and character offsets are defined. Ideally, in your JSONL:

  • The list of tokens should contain the character offsets.
  • The list of spans should contain the character offsets in start and end, and the token indices in token_start and token_end. The token indices are the ids in the token list.

Here's where I found some inconsistencies. In the first example, we have the following token:

# excerpt
    {
      "text": "JP Morgan Chase & Co.",
      "start": 251,
      "end": 272,
      "id": 36
    },

However, when I checked the spans, I saw something like this:

# excerpt
    {
      "end": 740,
      "label": "NP",
      "start": 737,
      "token_end": 36,
      "token_start": 36
    },

Which seems to refer to the JP Morgan ID (i.e. 36) but with the wrong character offsets (740 and 737). The label also seems incorrect.

You might also notice in the token list that the ids repeat. If I search for "id": 36, the tokens JP Morgan Chase & Co and IIM showed up:

    # excerpt (the IDs should be unique)
    {
      "text": "JP Morgan Chase & Co.",
      "start": 251,
      "end": 272,
      "id": 36
    },
    ... 
    {
      "text": "IIM",
      "start": 737,
      "end": 740,
      "id": 36
    },

My recommendation is to cleanup the IDs first and resolve the inconsistencies. Ideally, the IDs should be unique and that the offsets + spans are correct so that they'll render properly.