Sentencizer configuration questions

nlpfan · November 7, 2022, 12:19am

This is my second question for the day. Please pardon me.

I have input text that is already in one sentence per line format. I would like to disable the parser's sentence boundary detection ( which is the subject of a sister topic) and use the Spacy Sentencizer to create sentences from newlines.

As per the documentation, the Sentencizer takes a list of punctuation chars ( punct_chars) that can be used as sentence boundary. Hence, I am setting punct_chars = [ '\n' ] and it seems to work in the simple tests I have performed.

What I have not been able to figure out is the purpose of the overwrite flag; I have left it to default False but when should this flag be set to True ?

Also, do I need to worry about the scorer ? I am going with the default value. I would appreciate any explanation ( or a pointer to relevant documentation ) of the purpose of the scorer in the Sentencizer.

I would greatly appreciate any insights.

ryanwesslen · November 7, 2022, 5:44pm

Hi @nlpfan!

No problem on the question

It seems like your question is spaCy's sentence components, not Prodigy. Could you repost in spaCy's GitHub discussions?

We try to keep spaCy specific posts there. For example, this one describes more about overwrite:

They can give you more background on scorer's non-default settings.

nlpfan · February 19, 2023, 4:25pm

@ryanwesslen, Appreciate the suggestions above and the pointer to the spacy github discussions. I decided to go with the blank:en model with ner.manual recipe and that has worked fine for me so far. The text I have has a lot of partial sentences ( comes from cells in tables in a pdf) that I am pre-processing with a ( simple ) custom sentencizer and blank:en respects the pre-existing "sentence" boundaries.

Topic		Replies	Views
Improving the senter's performance	1	285	September 13, 2022
Custom sentence boundaries detection usage , spacy	10	1673	June 27, 2019
senter shows white space as Sentence Starting usage , senter	3	541	December 21, 2021
ValueError: [E030] Sentence boundaries unset. spacy	1	702	March 2, 2022
converting text to json for prodigy usage , spacy	2	452	September 22, 2020

Sentencizer configuration questions

Related topics