senter shows white space as Sentence Starting

thalish · December 17, 2021, 8:16pm

Was trying to create an annotation task on prodigy for sentence recogniser. I noticed that when using senter.correct - there are many occasions when space is highlighted as the sentence start token.

But while creating new annotations, I can never select space as the sentence_start, whats weirder is that if I remove the existing sentence start highlight on a space, then i can no longer select it back again.

Would this eventually affect training a sentence recognizer model? Am i missing something here?

example :

what prodigy highlights.

(s) This is a new Sentence.

When I try to highlight.

This(s) is a new Sentence.

ines · December 20, 2021, 11:02am

Hi! Are the whitespace characters newlines? If so, I think what's happening here is this: by default, Prodigy's manual span annotation interfaces will mark newline tokens as disabled because in most use cases (NER, spancat), you never want a span that include newlines. But this is obviously not a good default in this case.

Could you try setting "allow_newline_highlight": true in your prodigy.json and see if that lets you highlight newline tokens?

In general, having a lot of newline tokens identified as sentence starts is definitely unideal and can lead to worse results. So if this is common in your data, you could also consider adding a preprocessing step that normalises the whitespace and removes duplicate newlines, before passing the text to spaCy. This means you'll likely need fewer custom examples to improve the sentence recognizer.

thalish · December 20, 2021, 12:08pm

I don't think they are newlines as newlines are rendered using the carriage return symbol on the UI. If I understand your correctly, you are saying that newlines/white spaces should not be ideally annotated as sentence starts?

ines · December 21, 2021, 10:37am

The data that the trained senter shipped with the spaCy models was trained on didn't include any whitespace tokens so it actually shouldn't matter that much. The most important thing is that your data is annotated consistently – so if you always annotate whitespace as sentence starts, it should work fine

Topic		Replies	Views
whitespaces at the beginning of a line usage , ner , spacy	2	553	October 5, 2021
Whitespace tokens not displaying for some reason	3	135	November 21, 2023
display of tokens without spaces enhancement , ner , done , front-end	6	1845	June 17, 2020
Preserve preceding whitespaces at the beginning of a line usage	1	436	October 5, 2021
Double-spaces preventing manual span annotations Getting Started	1	26	May 13, 2025

senter shows white space as Sentence Starting

Related topics