Annotating strings without correct separation

Hey everybody,

I'm annotating examples from bankturnovers in prodify, with so far has worked very well.

Sadly sometimes I come past examples that look like this

...Kundennummer2785708...
...Hausmacherstr. 34Erstattung Stromkosten...
...Vertragsnummer 12345Zinsen 1.234,56Tilgung 456,78...

The relevant info for my NER here are often the numbers. But I noticed that - when annotating in prodigy - it is often not possible to label numbers or strings that are without spaces like Kundennummer2785708. Even though that would be the correct labeling.

I suppose this is because of the tokenization?
Is there any good way to solve this? e.g. switching the span-prediction - are there any drawbacks, since NER has worked great so far.

Thanks!

Hi @toadle,

If the NER architecture worked well so far I definitely wouldn't try to solve the tokenization issue by switching to the spancat architecture.
The issue is about data preprocessing not really the modelling technique and that's where we should address it.
The usual solution here would be to record some of these examples and see if you can fix the tokenization by rules. The best way to implement it would be to modify the default spaCy tokenizer by adding your custom rules so that you could easily integrate it in a spaCy pipeline both for annotation, training and production.
It does require learning a bit a more about customizing spaCy components but the documentation is excellent.

Prodigy ner.manual has the character highlighting mode that you can switch on and off from the UI. This would allow you to highlight subparts of a token but it wouldn't affect the tokenization, so you'd end up with spans that are misaligned with tokens and these would be rejected as training examples.
The character-level highlighting is meant for models that predict character-based tags not token-based tags, but you could use it to "record" the mistokenized examples and then use this record to write your custom tokenization rules.

The easiest way to check if your data contains misaligned span annotations is to convert Prodigy annotated example with tokens and spans to a spaCy doc.
If spans do not align with tokens, they are set to None and you can check for that in a simple Python script.
So once you've done your annotation in Prodigy, you could process your data to with a script similar to the one below to fish out the misaligned examples and try to fix them with the custom tokenizer:

def prodigy2spacy_ner(task: dict, nlp: Language) -> None:
    task_hash = task.get("_task_hash")
    tokens = task.get("tokens")
    words = [token["text"] for token in tokens]
    spaces = [token["ws"] for token in tokens]
    doc = Doc(nlp.vocab, words=words, spaces=spaces)
    
    prodigy_spans = task.get("spans", [])
    if prodigy_spans:
        spans = []
        for span in task["spans"]:
            spacy_span = doc.char_span(span["start"], span["end"], span["label"])
            if spacy_span is None:
                print(f"Misaligned span detected in example with task hash {task_hash}")
                print(f"Span: {span}")
                print()

Please see the related posts on dealing with similar "agglutinations" of words: