Annotating a single-word vs multi-token phrase with a label: How to decide?

Hello, I am new to Prodigy and am currently using it for recognizing a new custom entity type. I was wondering how should one make a choice when annotating multiple token phrase as an entity.

In your Reddit INGRED label example, you choose to annotate onions as well as green onions with INGRED label. Was the latter necessary when the token onions itself is enough to be delimited as an NER entity of type INGRED? Are there any advantages of NER annotating green onions as well in the text?

Hi @santoshbs ,

This is a difficult question that will come down to your data, what your application needs at the end of it, and how easily you can use rules to adjust between different annotation policies.

A good principle to keep in mind when thinking about language annotations is compositionality. The phrase "green onions" actually means a bit more than the sum of its parts. It doesn't just mean onions that are coloured green, like the phrase "green apples" does. Rather, green onions are what are also known as scallions or spring onions:

If you have modifiers that behave entirely compositionally, you probably want to leave those out of an entity annotation. One reason is that you can generally swap in a whole constituent instead of just a word if you're doing normal syntactic composition. For instance, consider the phrase "very light yellow, almost translucent". This can be swapped in most places you could use an adjective, so you can have "very light yellow, almost translucent onions".

If your policy had been to annotate colours as part of the entity, suddenly you're stuck annotating this huge phrase as an entity. But non-compositional phrases won't work like this, generally. They tend to be fixed phrases, because otherwise nobody would be able to learn what they mean. You can call scallions "green onions", but you can't call them "light jade onions", even though "light jade" means roughly "green".


Many thanks, @honnibal for pointing me to the notion of compositionality. This is a really good principle for me and collaborators to keep in mind as we annotate for a new entity.

By any chance, does Prodigy have a compilation of such guidelines for annotating entities? While you have made several extremely useful videos and documentation for using the Prodigy software and pipelines, some guidelines/tips for manual annotation or even pointers to some external resources that provide such guidelines would be of great help to novices like me. But I understand I might be asking for too much from the Prodigy team. Thanks, again!

1 Like

We do want to write up something more comprehensive that covers these topics, as I think it's not discussed clearly enough currently. We don't have that yet though.