Preferred labels


there are quite a few words that can have several labels, even in the same context. My favorite example:

Silicon Valley: LOC, (San Fransisco, Palo Alto, San Jose etc.)
Silicon Valley: ORG (all tech companies in that area)

How do/did you label “Silicon Valley”? Are there more words/concepts like this?


PS: I label “Silicon Valley” as LOCation

Hi! The thing is, there’s not really a “correct” answer here – it really depends on how you need your application to perform. spaCy’s pre-trained English models use the OntoNotes 5 annotation scheme for named entities. By that definition, “Silicon Valley” would pretty much always be considered a LOC. So if you’re updating an existing model with more annotations, it’s usually better to stick to the existing label scheme instead of trying to teach it a completely new definition.

What’s most important is that your label scheme is internally consistent, especially if you’re training a new model from scratch. This is btw also a good use case for Prodigy – developing a good and consistent scheme takes time and usually several iterations. So you probably want to try out different ideas, and once you’re actually annotating the data, you’ll probably come across examples that require changes to the label scheme, and so on.

On a semi-related note, this talk by @honnibal explains some of the motivation behind the iterative philosophy: