Why do ner_manual spans require start/end?

niedakh · September 9, 2021, 12:20pm

Hi,

I've been running NER annotations on pre-tokenized documents in prodigy for some time now, and with prodigy 1.10.7 I find that NER spans require start/end now, even though they have a token_start, token_end. Is this a bug?

My spans[0] is
{'token_start': 67, 'token_end': 67, 'channel_id': 0, 'label': 'PERSON', 'type': 'entity'}

I'm getting:
✘ Invalid task format for view ID 'ner_manual' spans -> 0 -> start field required spans -> 0 -> end field required

ines · September 13, 2021, 2:36am

Hi! I think this is mostly a case of the validation being more explicit now (and less strict in the past). The requirement is mostly there for consistency, since the start/end character offsets are the minimum viable span information and based on it, we can align the tokens. It's less common for people to have the full tokenization available, so that's usually the gap Prodigy needs to fill in.

That said, you're right that in theory, we can read the start and end offsets from the "tokens", assuming they provide the correct character offsets. I'll see if we can adjust the validation to make it conditional and require either tokens + token_start/token_end on the spans, or start/end on the spans.

Topic		Replies	Views
No start and end of span using data-to-spacy after rel.manual ner , spacy , solved , relations	4	853	May 5, 2021
rel.manual to train ner and dependency ner , done , solved , dep , relations	15	2047	September 7, 2020
Token indices in NER jsonl format usage , ner , solved	1	534	May 20, 2019
Anotation task format for ner_manual interface usage , ner , solved	7	1783	May 10, 2019
Spans not lining up with text tokens, autopopulate NER usage , ner , solved	2	350	March 3, 2021

Why do ner_manual spans require start/end?

Related topics