When I’m annotating NERs Prodigy suggests a fair number (maybe 15%) of candidates that are just whitespace. This seems odd to me, but I figure it’s just the model making mistakes so I mark them wrong. However, I just saw a string of candidates that looked like this
It seems like the English models have a tendency to label \n as GPE – see this issue reported on the spaCy issue tracker. (Interestingly, in the last comment, a user notes that this also happened when training a model from scratch on their own data.)
So it’s definitely possible that the entity here is actually a space + newline character suggested by spaCy’s model. Visually, it’s a little unfortunate, since it causes the entity label to break onto the new line. Given the web app’s rendering algorithm for NER, it’s unlikely to get the highlighting wrong – but you can annotate the task, save it and check it out in your dataset, just to be sure.
Btw, just had an idea for the front-end: In general, we do want Prodigy to always render whitespace as it comes in (as discussed here – your issue is actually a perfect example of why it matters). But how about a config option that will show visual indicators for whitespace characters? You know, like that setting in MS Word: a · for a space, ⇥ for a tab, ︎ for a newline and so on.
I don’t think it’s bad for the entity label to break over onto the next line. Regardless of which line the label appears on, the rendering communicates to the annotator that whitespace was hypothesized as a GPE, which is what matters.
Visual indicators for whitespace characters could be helpful. I would want it to be something that you could toggle on and off during an annotation session, the same way you can toggle them in a word processor. Most of the time you wouldn’t want to see the indicators (since they’re probably not displayed in the original document), but it would be helpful to quickly toggle them on in cases like this when you want to know why something looks weird.