Was trying to create an annotation task on prodigy for sentence recogniser. I noticed that when using senter.correct - there are many occasions when space is highlighted as the sentence start token.
But while creating new annotations, I can never select space as the sentence_start, whats weirder is that if I remove the existing sentence start highlight on a space, then i can no longer select it back again.
Would this eventually affect training a sentence recognizer model? Am i missing something here?
what prodigy highlights.
(s) This is a new Sentence.
When I try to highlight.
This(s) is a new Sentence.
Hi! Are the whitespace characters newlines? If so, I think what's happening here is this: by default, Prodigy's manual span annotation interfaces will mark newline tokens as
disabled because in most use cases (NER, spancat), you never want a span that include newlines. But this is obviously not a good default in this case.
Could you try setting
"allow_newline_highlight": true in your
prodigy.json and see if that lets you highlight newline tokens?
In general, having a lot of newline tokens identified as sentence starts is definitely unideal and can lead to worse results. So if this is common in your data, you could also consider adding a preprocessing step that normalises the whitespace and removes duplicate newlines, before passing the text to spaCy. This means you'll likely need fewer custom examples to improve the sentence recognizer.
I don't think they are newlines as newlines are rendered using the carriage return symbol on the UI. If I understand your correctly, you are saying that newlines/white spaces should not be ideally annotated as sentence starts?
The data that the trained
senter shipped with the spaCy models was trained on didn't include any whitespace tokens so it actually shouldn't matter that much. The most important thing is that your data is annotated consistently – so if you always annotate whitespace as sentence starts, it should work fine