Sentencizer configuration questions

This is my second question for the day. Please pardon me. :smiling_face:

I have input text that is already in one sentence per line format. I would like to disable the parser's sentence boundary detection ( which is the subject of a sister topic) and use the Spacy Sentencizer to create sentences from newlines.

As per the documentation, the Sentencizer takes a list of punctuation chars ( punct_chars) that can be used as sentence boundary. Hence, I am setting punct_chars = [ '\n' ] and it seems to work in the simple tests I have performed.

What I have not been able to figure out is the purpose of the overwrite flag; I have left it to default False but when should this flag be set to True ?

Also, do I need to worry about the scorer ? I am going with the default value. I would appreciate any explanation ( or a pointer to relevant documentation ) of the purpose of the scorer in the Sentencizer.

I would greatly appreciate any insights.

Hi @nlpfan!

No problem on the question :slight_smile:

It seems like your question is spaCy's sentence components, not Prodigy. Could you repost in spaCy's GitHub discussions?

We try to keep spaCy specific posts there. For example, this one describes more about overwrite:

They can give you more background on scorer's non-default settings.

@ryanwesslen, Appreciate the suggestions above and the pointer to the spacy github discussions. I decided to go with the blank:en model with ner.manual recipe and that has worked fine for me so far. The text I have has a lot of partial sentences ( comes from cells in tables in a pdf) that I am pre-processing with a ( simple ) custom sentencizer and blank:en respects the pre-existing "sentence" boundaries.