I have some special tokenisation rules and a trained sentence recognizer that i use in my production pipeline.
What happens if i annotate a spancat dataset with a standard spacy model? Can i use my custom model during or after training, or will that just confuse the spancat component?
Is it possible (or even necessary) to retokenize the allready annotated data with my special tokenisation rules?
Are your "sentence recognizer" and spancat dataset disjoint i.e. are these completely different collections of documents?
If yes, then you might probably just train separate pipelines for these two tasks each with its own tokenizer.
In general, there can be only one tokenizer in spaCy pipeline, so if, conversely, you want to have both textcat (correct me if I'm wrong but I assumed textcat when you said "sentence recognizer") and spancat in the same pipeline, then, yes, you would have to use the same tokenizer.
What will happen if you use your custom tokenizer for training the data annotated with another tokenizer?
If you use your custom tokenizer for training the data annotated with another tokenizer and there will be a mismatch between span token boundaries and the actual tokens, all these examples will be rejected. In other words, span start and end indices must coincide with token span and end indices - if that is not the case, the span will be rejected as invalid.
You can check whether your span annotations are misaligned with respect to your custom tokenizer like so: (it would be slightly easier if you converted you spancat dataset to spacy format e.g. with Prodigy data-to-spacy command:
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("/path/to/model_with_custom_tokenizer")
doc_bin = DocBin().from_disk("train.spacy") # spancat dataset
for doc in doc_bin.get_docs(nlp.vocab):
retok_doc = nlp.make_doc(doc.text)
for annot_span in doc.spans:
span = retok_doc.char_span(annot_span.start_char, annot_span.end_char)
if span is None: # span is invalid due to misalignment
print("misaligned:", span.text, "--", doc.text)
If these are just a few cases then it is fine, if most of your span examples are rejected, then you need to fix the misalignment.
Since you will need the same tokenization in production, then it's probably best to retokenize the the spancat dataset and adjust the span boundaries to make sure they align with the tokenization. You would have to write your own alignment script for that. Also, see Alignment spaCy utility in case it's helpful while working on alignment script.