Spans not lining up with text tokens, autopopulate NER

aanifh · February 26, 2021, 8:33pm

Hi,
I have created an auto-populated file, but with "text", "tokens" and spans" fields". But when I upload it to Prodigy, the spans are misaligned with the text, from the 2nd sentence onward. However, the "ID", "start", "end" etc numbers seem to align to me
Do you have any advice on how to fix this? (I know the "ws" field is off, but even if I don't include it it doesn't seem to matter) Thank you

{"text": "I'm standing here with you, why won't you move?", "tokens": [{"text": "I'm", "start": 38, "end": 40, "id": 13, "ws": "true"}, {"text": "standing", "start": 41, "end": 48, "id": 14, "ws": "true"}, {"text": "here", "start": 49, "end": 52, "id": 15, "ws": "true"}, {"text": "with", "start": 53, "end": 56, "id": 16, "ws": "true"}, {"text": "you", "start": 57, "end": 59, "id": 17, "ws": "true"}, {"text": ",", "start": 60, "end": 60, "id": 18, "ws": "true"}, {"text": "why", "start": 61, "end": 63, "id": 19, "ws": "true"}, {"text": "won't", "start": 64, "end": 68, "id": 20, "ws": "true"}, {"text": "you", "start": 69, "end": 71, "id": 21, "ws": "true"}, {"text": "move", "start": 72, "end": 75, "id": 22, "ws": "true"}, {"text": "?", "start": 76, "end": 76, "id": 23, "ws": "true"}], "spans": [{"text": "you", "start": 57, "end": 59, "token_start": 17, "token_end": 17, "label": "PRON"}, {"text": "you", "start": 69, "end": 71, "token_start": 21, "token_end": 21, "label": "PRON"}]}

And this happened when I tried to annotate one of the spans:

ines · February 27, 2021, 12:10am

Hi! I was just looking at the example you posted and it seems like the start/end offsets and IDs are off? For example, this is the first token:

{"text": "I'm", "start": 38, "end": 40, "id": 13, "ws": "true"}

But it defines character 38 as its start offset into the "text" (which should probably be 0), and 13 as its ID. Did you maybe generate them from sentences? If so, you can adjust the standalone examples by subtracting the sentence start and token offsets – so if your sentence starts at character 38, the first token's offset would be token.start - 38 = 0, and so on.

aanifh · March 3, 2021, 1:17pm

Ahhh I understand now, I was doing a total tally from across the document It's per-each "line", I've adjusted my code and it's all working really well now, thank you very much!

Topic		Replies	Views
Mismatching spans usage , ner , solved	3	336	July 15, 2021
Tokenization causes glitched text usage , ner , solved	1	376	November 2, 2021
Span of annotation is not correct in the browser when trying to re-annotate usage , ner , done , solved	2	602	March 22, 2019
Alignment of NER tokens when creating suggestions using Transformers ner	7	1068	August 12, 2022
Mismatched Tokenization on NER usage , ner	2	1140	June 25, 2021

Spans not lining up with text tokens, autopopulate NER

Related topics