Hello, I have exported NER annotations with the tradional db-out aproach.
I wrote an script to visually inspect the annotations:
import spacy
nlp = spacy.blank(lang)
with open("/annotations.jsonl", 'r',encoding='utf-8') as f:
for line in f:
annot = json.loads(line)
doc = nlp(annot["text"])
for elem in annot["spans"]:
print(doc[ elem["token_start"] : elem["token_end"] ].text)
print(doc.text[ elem["start"] : elem["end"] ])
print()
The first printing method prints the annotations according to the tokens boundaries and the second one according to the characters offsets.
I'm seeing this type of results:
keen eye for
keen eye for detail
specimen reception and
specimen reception and disposal
Given to python exclusion of upper bound on lists could it be some inconsistency problems between the tokens and the offsets export?
Although this don't influence the training of NER in spacy, giving that the gold-to-spacy recipe is made by character offsets.
I just want to make sure if this is the desired result and let you know or if I may be missing something.
Best regards