Custom sentence segmenter with Prodigy 1.11/Spacy 3.x

Balo · August 17, 2021, 9:07am

Hi

To train a model to recognize labels (NER) in german law texts I implemented a custom sentence segmenter (based on the example from blackstone (https://github.com/ICLRandD/Blackstone#sentence-segmenter)).
Since I am currently moving to Spacy 3.x I registered the sentence segmenter as a language factory in the config file.
When trying to train, it seems there is an interference with the tok2vec component and my implementation of the segmenter. (see screenshot).

Could you please give me a hint what could be the reason for the error?

And more important: Is this still the right way to use a custom sentence segmenter? I understand that we now have the possiblity to train it as a model component. Is there any guidance or examples available on how to do that?
Thank you very much !

adriane · August 17, 2021, 11:11am

Hi, this looks like a problem with the sentence segmentation algorithm itself, which might not check whether doc[index+1] is past the end of the doc correctly? Maybe you're iterating over the whole doc rather than doc[:-2] or something like that?

It doesn't immediately look like a problem with prodigy or spacy.

Balo · August 17, 2021, 12:16pm

Thanks, @adriane . You are absolutely right! Sorry about that! Don't know why I missed that...

Is there any guidance on how to train a sentence segmenter component in prodigy?

adriane · August 17, 2021, 5:21pm

The basic recipes are here: https://prodi.gy/docs/recipes#sent

You should be able to train in prodigy using prodigy train --senter dataset. And you can always export the data with data-to-spacy and then train a pipeline with a senter component in spacy directly.

The only tricky thing might be if you're trying to extend an existing pretrained pipeline like en_core_web_sm that contains a senter component that's disabled by default. If you want to start from en_core_web_sm, you should load en_core_web_sm with everything except senter excluded, enable senter with nlp.enable_pipe, and then save that model to a local directory to use as the input model with prodigy.

Topic		Replies	Views
Sentence Segmentation and Annotations usage , spacy , legal	2	1544	January 23, 2020
Custom sentence boundaries detection usage , spacy	10	1674	June 27, 2019
Migration from spaCy 2.3 to 3.x + Annotating data in prodigy usage , spacy	1	459	August 29, 2021
Using a custom component in NER done , spacy	4	1840	February 23, 2018
Error while using ner.correct usage , ner	4	1056	January 19, 2020

Custom sentence segmenter with Prodigy 1.11/Spacy 3.x

Related topics