Token indices in NER jsonl format

Hello,
i am a bit confused about the format for NER annotations.
In the documentation i see:

{
    "text": "Hello Apple",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Apple", "start": 6, "end": 11, "id": 1}
    ],
    "spans": [
        {"start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1}
    ]
}

why token_start and token_end have the same value? I thought it was an example but no, i must do my_span.end - 1 for token_end value to correctly see the spans in ner.manual interface.

Could anyone explain the reason? I basically i pre-annotate my sentences and then use ner.manual to check them and add other labels.

Yes, in the format here, the token_start and token_end both describe the token the entity span starts/ends at. So the span here decribes the token 1, “Apple”. In hindsight, it’s slightly inconsistent with how spaCy annotates token indices, and if I had to do it again, I’d probably make the token end index exclusive. But that’d be a backwards-incompatible change.

Btw, if your pre-annotated spans are consistent with the tokenization and you’re using ner.manual, you can also leave out the "tokens", and the recipe will take care of those automatically.