Sentence segmentation in NER.teach

Balo · March 9, 2020, 4:02pm

Hi, I try to train up to 10 different NER labels in German law-related documents. I started with a blank German model and a pipeline containing the sentencizer and NER. Due to German abbreviations (like
"Art.") and German date format (like "9. März 2020") the sentencizer performs rather poorly - which becomes especially visible when using ner.teach . As far as I understand NER takes IS_SENT_START into account (discussed here) .
Am I right to be worried that rather poor sentence splitting might negatively influence NER?

ines · March 9, 2020, 7:39pm

If your model does segment sentences (e.g. via the parser or sentencizer), then that's possible, yes. The entity recognizer won't try to predict entities across sentence boundaries (which is typically good, because it eliminates many likely incorrect candidates). So it's good this came up during annotation, because it lets you resolve the underlying problem.

In ner.teach, you can set --unsegmented to disable sentence segmentation during annotation. Just make sure that the texts you feed in aren't too long – otherwise it'll make the process too slow because the beam search (to create the different possible analyses) takes too long.

At runtime, you then also want to disable sentence segmentation (either by removing the sentencizer or adding a custom component first in the pipeline that sets all tokens totoken.is_sent_start = False, except the first). Of course you could also use your own sentence segmentation logic that does better and considers the special cases you're dealing with. I know English and German legal texts are pretty different, but maybe you can still adapt some of the ideas from blackstone's sentence segmenter.

Balo · March 10, 2020, 7:06am

Thank you very much for the incredibly quick and comprehensive answer!
For now, I will remove any sentence segmentation from the model and work with the --unsegemented flag during annotation. (I can manage to cut the feed input into smaller fragments).
But I will definetly also have a look at the blackstone solution. Thx again!

Topic		Replies	Views
Sentence Segmentation and Annotations usage , spacy , legal	2	1540	January 23, 2020
Custom sentence segmenter with Prodigy 1.11/Spacy 3.x usage , spacy , solved , senter	3	499	August 17, 2021
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
Questions about ner.teach and ner.correct usage , ner	10	378	January 11, 2024
Getting warning while using ner.correct usage , ner , solved	2	533	April 2, 2020

Sentence segmentation in NER.teach

Related topics