Disappearing spans when using data-to-spacy

koaning · December 14, 2023, 2:23pm

As an aside for extra context all base English models inside of spaCy use the same tokeniser under the hood. So the tokens from nlp = spacy.blank("en") should be the same as those from spacy.load("en_core_web_sm"). These tokens are all determined by the same rule based system.

But yeah, it does sound like there's a mismatch. One avenue to explore is to retokenize everything by using this spaCy method which comes with a alignment_mode parameter that should allow you to wiggle around minor character issues. Beware that this is an automated method which may also cause spans to be highlighted that weren't originally the plan. But it could help your current issue, if only as a temporary measure.

import srsly 
import spacy 

nlp = spacy.blank("en")
ex = next(srsly.read_jsonl("debug.jsonl"))
doc = nlp(ex['text'])
doc.char_span(228, 230, label="NAME", alignment_mode="expand")
# Returns `Dr.`

Have you tried something like that?

Topic		Replies	Views
Manual Span annotations seemingly disappearing when converting to spacy spancat	1	367	June 16, 2023
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
No start and end of span using data-to-spacy after rel.manual ner , spacy , solved , relations	4	853	May 5, 2021
Skip mismatched tokenization? usage , ner , spacy , solved	2	394	February 8, 2022
Mismatched tokenization	1	509	September 13, 2022

Disappearing spans when using data-to-spacy

Related topics