Ambiguous NER annotation decisions

Thanks, this is a good question! The thing with NER (and most NLP applications actually) is that there’s no “objective truth”. It all depends on your application and the results you want to produce.

spaCy’s English models use the OntoNotes 5 scheme for NER annotations, so if you were following that scheme “hours” on its own would probably not be considered a TIME entity. So you would reject the example. In cases like this, rejecting is actually better than ignoring, because you’re explicitly telling Prodigy “no, this is wrong, try again”. There are only so many possible analyses of the entities and their boundaries, and by explicitly rejecting wrong boundaries, you’re moving the model closer to the correct ones.

However, it ultimately comes down to this: How do you want your application to perform? If you need to extract times and dates in a lot of different formats and then analyse and parse them, you probably want the model to only learn the exact spans. “hours” on its own is pretty useless. But if you mostly care about whether a text is about hours as opposed to minutes or seconds, regardless of the exact time span, teaching your model to ignore the numbers could also make sense.

Similarly, what your application considers an ORG or a PRODUCT doesn’t always need to match the underlying annotation scheme. I actually often find the pre-defined categories and definitions pretty unsatisfying for modern text (for example, is “YouTube” a PRODUCT? A WORK_OF_ART? Maybe it needs its own category PLATFORM?).

So when you come across an ambiguous example like this, a better way to think about it would be to ask yourself: “If my model produced this result, would I be happy about it and would it benefit the rest of my application?” The fact that your corpus is not perfectly standardised is actually a good thing – especially if your application is supposed to handle unpredictable text like user input. It’s also where a custom NER model is most powerful.

2 Likes