Disappearing spans when using data-to-spacy

As an aside for extra context all base English models inside of spaCy use the same tokeniser under the hood. So the tokens from nlp = spacy.blank("en") should be the same as those from spacy.load("en_core_web_sm"). These tokens are all determined by the same rule based system.

But yeah, it does sound like there's a mismatch. One avenue to explore is to retokenize everything by using this spaCy method which comes with a alignment_mode parameter that should allow you to wiggle around minor character issues. Beware that this is an automated method which may also cause spans to be highlighted that weren't originally the plan. But it could help your current issue, if only as a temporary measure.

import srsly 
import spacy 

nlp = spacy.blank("en")
ex = next(srsly.read_jsonl("debug.jsonl"))
doc = nlp(ex['text'])
doc.char_span(228, 230, label="NAME", alignment_mode="expand")
# Returns `Dr.` 

Have you tried something like that?

1 Like