Hi @toadle,
If the NER architecture worked well so far I definitely wouldn't try to solve the tokenization issue by switching to the spancat architecture.
The issue is about data preprocessing not really the modelling technique and that's where we should address it.
The usual solution here would be to record some of these examples and see if you can fix the tokenization by rules. The best way to implement it would be to modify the default spaCy tokenizer by adding your custom rules so that you could easily integrate it in a spaCy pipeline both for annotation, training and production.
It does require learning a bit a more about customizing spaCy components but the documentation is excellent.
Prodigy ner.manual
has the character highlighting mode that you can switch on and off from the UI. This would allow you to highlight subparts of a token but it wouldn't affect the tokenization, so you'd end up with spans that are misaligned with tokens and these would be rejected as training examples.
The character-level highlighting is meant for models that predict character-based tags not token-based tags, but you could use it to "record" the mistokenized examples and then use this record to write your custom tokenization rules.
The easiest way to check if your data contains misaligned span annotations is to convert Prodigy annotated example with tokens and spans to a spaCy doc
.
If spans do not align with tokens, they are set to None
and you can check for that in a simple Python script.
So once you've done your annotation in Prodigy, you could process your data to with a script similar to the one below to fish out the misaligned examples and try to fix them with the custom tokenizer:
def prodigy2spacy_ner(task: dict, nlp: Language) -> None:
task_hash = task.get("_task_hash")
tokens = task.get("tokens")
words = [token["text"] for token in tokens]
spaces = [token["ws"] for token in tokens]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
prodigy_spans = task.get("spans", [])
if prodigy_spans:
spans = []
for span in task["spans"]:
spacy_span = doc.char_span(span["start"], span["end"], span["label"])
if spacy_span is None:
print(f"Misaligned span detected in example with task hash {task_hash}")
print(f"Span: {span}")
print()
Please see the related posts on dealing with similar "agglutinations" of words: