Token indices in NER jsonl format

damiano · May 18, 2019, 1:42pm

Hello,
i am a bit confused about the format for NER annotations.
In the documentation i see:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ]
}

why token_start and token_end have the same value? I thought it was an example but no, i must do my_span.end - 1 for token_end value to correctly see the spans in ner.manual interface.

Could anyone explain the reason? I basically i pre-annotate my sentences and then use ner.manual to check them and add other labels.

ines · May 20, 2019, 9:35am

Yes, in the format here, the token_start and token_end both describe the token the entity span starts/ends at. So the span here decribes the token 1, “Apple”. In hindsight, it’s slightly inconsistent with how spaCy annotates token indices, and if I had to do it again, I’d probably make the token end index exclusive. But that’d be a backwards-incompatible change.

Btw, if your pre-annotated spans are consistent with the tokenization and you’re using ner.manual, you can also leave out the "tokens", and the recipe will take care of those automatically.

Topic		Replies	Views
Boundaries (token/offsets) on Ner annotations ner , database , solved	1	535	October 16, 2019
Providing NER token spans only (no character offsets) usage , spacy , best-practices	2	1871	August 12, 2019
Why do ner_manual spans require start/end? enhancement , usage , ner	1	500	September 13, 2021
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
ner-manual does not use custom tokens ner , done , solved	3	714	January 29, 2020

Token indices in NER jsonl format

Related topics