Whitespace NER candidate or rendering bug?

wpm · January 5, 2018, 12:48am

When I’m annotating NERs Prodigy suggests a fair number (maybe 15%) of candidates that are just whitespace. This seems odd to me, but I figure it’s just the model making mistakes so I mark them wrong. However, I just saw a string of candidates that looked like this

I’m wondering if here Prodigy is actually correctly suggesting “Chile” as the candidate GPE and getting the highlighting is wrong.

Does this look suspicious to you? Or is it normal for Prodigy to propose whitespace as a possible entity?

ines · January 5, 2018, 2:40am

It seems like the English models have a tendency to label \n as GPE – see this issue reported on the spaCy issue tracker. (Interestingly, in the last comment, a user notes that this also happened when training a model from scratch on their own data.)

So it’s definitely possible that the entity here is actually a space + newline character suggested by spaCy’s model. Visually, it’s a little unfortunate, since it causes the entity label to break onto the new line. Given the web app’s rendering algorithm for NER, it’s unlikely to get the highlighting wrong – but you can annotate the task, save it and check it out in your dataset, just to be sure.

Btw, just had an idea for the front-end: In general, we do want Prodigy to always render whitespace as it comes in (as discussed here – your issue is actually a perfect example of why it matters). But how about a config option that will show visual indicators for whitespace characters? You know, like that setting in MS Word: a · for a space, ⇥ for a tab, ︎ for a newline and so on.

Edit: Proof of concept:

wpm · January 5, 2018, 4:25pm

I don’t think it’s bad for the entity label to break over onto the next line. Regardless of which line the label appears on, the rendering communicates to the annotator that whitespace was hypothesized as a GPE, which is what matters.

Visual indicators for whitespace characters could be helpful. I would want it to be something that you could toggle on and off during an annotation session, the same way you can toggle them in a word processor. Most of the time you wouldn’t want to see the indicators (since they’re probably not displayed in the original document), but it would be helpful to quickly toggle them on in cases like this when you want to know why something looks weird.

Topic		Replies	Views
Newlines included in entity spans bug , ner	6	387	August 24, 2023
Prodigy not labeling correctly usage , ner	1	512	July 18, 2018
Segmentation and newlines in ner.manual usage , ner , done	26	5519	August 14, 2019
ner.teach suggests spaces as entities? usage , ner , solved	13	1673	August 3, 2018
Does prodigy.models.ner.EntityRecognizer constructor modify the underlying nlp model? usage , ner , done , solved	5	663	July 8, 2021

Whitespace NER candidate or rendering bug?

Related topics