Boundaries (token/offsets) on Ner annotations

Hello, I have exported NER annotations with the tradional db-out aproach.
I wrote an script to visually inspect the annotations:

import spacy

nlp = spacy.blank(lang)

with open("/annotations.jsonl", 'r',encoding='utf-8') as f:

    for line in f:

        annot = json.loads(line)

        doc = nlp(annot["text"])

        for elem in annot["spans"]:

            print(doc[ elem["token_start"] : elem["token_end"] ].text)

            print(doc.text[ elem["start"] : elem["end"] ])


The first printing method prints the annotations according to the tokens boundaries and the second one according to the characters offsets.

I'm seeing this type of results:

keen eye for
keen eye for detail

specimen reception and
specimen reception and disposal

Given to python exclusion of upper bound on lists could it be some inconsistency problems between the tokens and the offsets export?

Although this don't influence the training of NER in spacy, giving that the gold-to-spacy recipe is made by character offsets.

I just want to make sure if this is the desired result and let you know or if I may be missing something.

Best regards

Yes, the token_end has an off-by-one quirk currently, which we've avoided correcting for backwards compatibility reasons. It currently refers to the last element, where it should really be the first element outside of the entry, for consistency with python list slicing semantics, as you point out.

The two are handled correctly within Prodigy, and you should be able to just add one when you're working with the token_end. You can also use the ner.print-dataset command, which i think should show you the information you're looking for.