Sentence segmentation in NER.teach

Hi, I try to train up to 10 different NER labels in German law-related documents. I started with a blank German model and a pipeline containing the sentencizer and NER. Due to German abbreviations (like
"Art.") and German date format (like "9. März 2020") the sentencizer performs rather poorly - which becomes especially visible when using ner.teach . As far as I understand NER takes IS_SENT_START into account (discussed here) .
Am I right to be worried that rather poor sentence splitting might negatively influence NER?

If your model does segment sentences (e.g. via the parser or sentencizer), then that's possible, yes. The entity recognizer won't try to predict entities across sentence boundaries (which is typically good, because it eliminates many likely incorrect candidates). So it's good this came up during annotation, because it lets you resolve the underlying problem.

In ner.teach, you can set --unsegmented to disable sentence segmentation during annotation. Just make sure that the texts you feed in aren't too long – otherwise it'll make the process too slow because the beam search (to create the different possible analyses) takes too long.

At runtime, you then also want to disable sentence segmentation (either by removing the sentencizer or adding a custom component first in the pipeline that sets all tokens totoken.is_sent_start = False, except the first). Of course you could also use your own sentence segmentation logic that does better and considers the special cases you're dealing with. I know English and German legal texts are pretty different, but maybe you can still adapt some of the ideas from blackstone's sentence segmenter.

1 Like

Thank you very much for the incredibly quick and comprehensive answer!
For now, I will remove any sentence segmentation from the model and work with the --unsegemented flag during annotation. (I can manage to cut the feed input into smaller fragments).
But I will definetly also have a look at the blackstone solution. Thx again!

1 Like