I am training a german NER model (spacy 2.3.7) with law texts as input in prodigy (ner.correct-Recipe). The input sometimes has unit separators in to divide longer german words at the end of a line. (e.g. "Ausla-gen"). This is also how the model will get the input in production. The Tokenizer handles the unit separator as its own token (e.g. "Ausla","\x1f","gen"). In my understanding this makes it more complicated for the model to learn, since "Auslagen" and "Ausla-gen" will be handled completely different. Naturally, "Ausla-gen" will occur only a handful of times. Do you have any advice or good practice how we should handle this? Is there a tokenizer fix that lets the model ignore the unit separators? Of course I could simply filter the unit separators. But - if feasible - I would like to leave the text structure intact , so I can easily map predicted entities back to the original text.
Thanks in advance!
I think this is a case where it makes sense to apply some text normalization, as a preprocess. There's a function for this in textacy, although I haven't used it myself: Text Preprocessing — textacy 0.11.0 documentation
Thanks a lot for the hint! I will definitely have a look at textacy.