Good practice for unit separators and training

Balo · September 8, 2021, 6:25am

Hi there,
I am training a german NER model (spacy 2.3.7) with law texts as input in prodigy (ner.correct-Recipe). The input sometimes has unit separators in to divide longer german words at the end of a line. (e.g. "Ausla-gen"). This is also how the model will get the input in production. The Tokenizer handles the unit separator as its own token (e.g. "Ausla","\x1f","gen"). In my understanding this makes it more complicated for the model to learn, since "Auslagen" and "Ausla-gen" will be handled completely different. Naturally, "Ausla-gen" will occur only a handful of times. Do you have any advice or good practice how we should handle this? Is there a tokenizer fix that lets the model ignore the unit separators? Of course I could simply filter the unit separators. But - if feasible - I would like to leave the text structure intact , so I can easily map predicted entities back to the original text.
Thanks in advance!

honnibal · September 14, 2021, 5:24pm

I think this is a case where it makes sense to apply some text normalization, as a preprocess. There's a function for this in textacy, although I haven't used it myself: Text Preprocessing — textacy 0.11.0 documentation

Balo · September 15, 2021, 8:10am

Thanks a lot for the hint! I will definitely have a look at textacy.

Topic		Replies	Views
NER tagging in non-alphabetic language ner , spacy	1	408	May 2, 2022
Help with tokenization numbers with units of measure usage , ner , spacy	3	2837	August 6, 2018
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	413	July 7, 2023
NER: English training dataset for German language usage , ner , solved	2	367	May 7, 2020
Annotating strings without correct separation ner , best-practices	1	8	November 13, 2024

Good practice for unit separators and training

Related topics