Why do ner_manual spans require start/end?


I've been running NER annotations on pre-tokenized documents in prodigy for some time now, and with prodigy 1.10.7 I find that NER spans require start/end now, even though they have a token_start, token_end. Is this a bug?

My spans[0] is
{'token_start': 67, 'token_end': 67, 'channel_id': 0, 'label': 'PERSON', 'type': 'entity'}

I'm getting:
✘ Invalid task format for view ID 'ner_manual' spans -> 0 -> start field required spans -> 0 -> end field required

Hi! I think this is mostly a case of the validation being more explicit now (and less strict in the past). The requirement is mostly there for consistency, since the start/end character offsets are the minimum viable span information and based on it, we can align the tokens. It's less common for people to have the full tokenization available, so that's usually the gap Prodigy needs to fill in.

That said, you're right that in theory, we can read the start and end offsets from the "tokens", assuming they provide the correct character offsets. I'll see if we can adjust the validation to make it conditional and require either tokens + token_start/token_end on the spans, or start/end on the spans.