Ah, okay, that makes sense then. Are you just using mark
? I think I misread your initial question and thought you were using the built-in ner.manual
recipe, which does take care of the tokenization automatically.
If you need your own custom tokens that align with your entity spans, then you also need to provide them. It might be worth writing a little script to check how many of the spans do not align – maybe it’s just one or two that you can easily correct manually (or exclude from your data).
An easy way to do this is to use spaCy’s Doc.char_span
method, which creates a token span from character offsets. If the character offsets don’t align to the tokens, it returns None
. So you can do something like this:
nlp = spacy.load("en_core_web_sm") # or other model
for example in examples: # your existing examples
doc = nlp(example["text"])
for span in example["spans"]:
char_span = doc.char_span(span["start"], span["end"])
if char_span is None: # start and end don't map to tokens
print("Misaligned tokens", example["text"], span)