To train a model to recognize labels (NER) in german law texts I implemented a custom sentence segmenter (based on the example from blackstone (https://github.com/ICLRandD/Blackstone#sentence-segmenter)).
Since I am currently moving to Spacy 3.x I registered the sentence segmenter as a language factory in the config file.
When trying to train, it seems there is an interference with the tok2vec component and my implementation of the segmenter. (see screenshot).
Could you please give me a hint what could be the reason for the error?
And more important: Is this still the right way to use a custom sentence segmenter? I understand that we now have the possiblity to train it as a model component. Is there any guidance or examples available on how to do that?
Thank you very much !
Hi, this looks like a problem with the sentence segmentation algorithm itself, which might not check whether doc[index+1] is past the end of the doc correctly? Maybe you're iterating over the whole doc rather than doc[:-2] or something like that?
It doesn't immediately look like a problem with prodigy or spacy.
You should be able to train in prodigy using prodigy train --senter dataset. And you can always export the data with data-to-spacy and then train a pipeline with a senter component in spacy directly.
The only tricky thing might be if you're trying to extend an existing pretrained pipeline like en_core_web_sm that contains a senter component that's disabled by default. If you want to start from en_core_web_sm, you should load en_core_web_sm with everything except senter excluded, enable senter with nlp.enable_pipe, and then save that model to a local directory to use as the input model with prodigy.