Hi @aelkholy
Thanks for waiting! It seems that there are some inconsistencies with the JSONL file, especially on how the token indices and character offsets are defined. Ideally, in your JSONL:
- The list of tokens should contain the character offsets.
- The list of spans should contain the character offsets in
start
and end
, and the token indices in token_start
and token_end
. The token indices are the id
s in the token list.
Here's where I found some inconsistencies. In the first example, we have the following token:
# excerpt
{
"text": "JP Morgan Chase & Co.",
"start": 251,
"end": 272,
"id": 36
},
However, when I checked the spans, I saw something like this:
# excerpt
{
"end": 740,
"label": "NP",
"start": 737,
"token_end": 36,
"token_start": 36
},
Which seems to refer to the JP Morgan ID (i.e. 36) but with the wrong character offsets (740 and 737). The label also seems incorrect.
You might also notice in the token list that the ids repeat. If I search for "id": 36, the tokens JP Morgan Chase & Co and IIM showed up:
# excerpt (the IDs should be unique)
{
"text": "JP Morgan Chase & Co.",
"start": 251,
"end": 272,
"id": 36
},
...
{
"text": "IIM",
"start": 737,
"end": 740,
"id": 36
},
My recommendation is to cleanup the IDs first and resolve the inconsistencies. Ideally, the IDs should be unique and that the offsets + spans are correct so that they'll render properly.