Boundaries (token/offsets) on Ner annotations

AlejandroJCR · October 14, 2019, 1:29pm

Hello, I have exported NER annotations with the tradional db-out aproach.
I wrote an script to visually inspect the annotations:

import spacy

nlp = spacy.blank(lang)

with open("/annotations.jsonl", 'r',encoding='utf-8') as f:

    for line in f:

        annot = json.loads(line)

        doc = nlp(annot["text"])

        for elem in annot["spans"]:

            print(doc[ elem["token_start"] : elem["token_end"] ].text)

            print(doc.text[ elem["start"] : elem["end"] ])

        print()

The first printing method prints the annotations according to the tokens boundaries and the second one according to the characters offsets.

I'm seeing this type of results:

keen eye for
keen eye for detail

specimen reception and
specimen reception and disposal

Given to python exclusion of upper bound on lists could it be some inconsistency problems between the tokens and the offsets export?

Although this don't influence the training of NER in spacy, giving that the gold-to-spacy recipe is made by character offsets.

I just want to make sure if this is the desired result and let you know or if I may be missing something.

Best regards

honnibal · October 16, 2019, 10:40am

Yes, the token_end has an off-by-one quirk currently, which we've avoided correcting for backwards compatibility reasons. It currently refers to the last element, where it should really be the first element outside of the entry, for consistency with python list slicing semantics, as you point out.

The two are handled correctly within Prodigy, and you should be able to just add one when you're working with the token_end. You can also use the ner.print-dataset command, which i think should show you the information you're looking for.

Topic		Replies	Views
Providing NER token spans only (no character offsets) usage , spacy , best-practices	2	1872	August 12, 2019
Token indices in NER jsonl format usage , ner , solved	1	534	May 20, 2019
Convert annotated NER data to entity "offset format" ner , spacy , solved	2	884	August 25, 2020
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018

Boundaries (token/offsets) on Ner annotations

Related topics