Spans not lining up with text tokens, autopopulate NER

Hi,
I have created an auto-populated file, but with "text", "tokens" and spans" fields". But when I upload it to Prodigy, the spans are misaligned with the text, from the 2nd sentence onward. However, the "ID", "start", "end" etc numbers seem to align to me
Do you have any advice on how to fix this? (I know the "ws" field is off, but even if I don't include it it doesn't seem to matter) Thank you :smiley:

{"text": "I'm standing here with you, why won't you move?", "tokens": [{"text": "I'm", "start": 38, "end": 40, "id": 13, "ws": "true"}, {"text": "standing", "start": 41, "end": 48, "id": 14, "ws": "true"}, {"text": "here", "start": 49, "end": 52, "id": 15, "ws": "true"}, {"text": "with", "start": 53, "end": 56, "id": 16, "ws": "true"}, {"text": "you", "start": 57, "end": 59, "id": 17, "ws": "true"}, {"text": ",", "start": 60, "end": 60, "id": 18, "ws": "true"}, {"text": "why", "start": 61, "end": 63, "id": 19, "ws": "true"}, {"text": "won't", "start": 64, "end": 68, "id": 20, "ws": "true"}, {"text": "you", "start": 69, "end": 71, "id": 21, "ws": "true"}, {"text": "move", "start": 72, "end": 75, "id": 22, "ws": "true"}, {"text": "?", "start": 76, "end": 76, "id": 23, "ws": "true"}], "spans": [{"text": "you", "start": 57, "end": 59, "token_start": 17, "token_end": 17, "label": "PRON"}, {"text": "you", "start": 69, "end": 71, "token_start": 21, "token_end": 21, "label": "PRON"}]}

And this happened when I tried to annotate one of the spans:

Hi! I was just looking at the example you posted and it seems like the start/end offsets and IDs are off? For example, this is the first token:

{"text": "I'm", "start": 38, "end": 40, "id": 13, "ws": "true"}

But it defines character 38 as its start offset into the "text" (which should probably be 0), and 13 as its ID. Did you maybe generate them from sentences? If so, you can adjust the standalone examples by subtracting the sentence start and token offsets – so if your sentence starts at character 38, the first token's offset would be token.start - 38 = 0, and so on.

1 Like

Ahhh I understand now, I was doing a total tally from across the document :see_no_evil: It's per-each "line", I've adjusted my code and it's all working really well now, thank you very much! :smiley:

1 Like