Annotating a single-word vs multi-token phrase with a label: How to decide?

santoshbs · February 18, 2021, 11:29pm

Hello, I am new to Prodigy and am currently using it for recognizing a new custom entity type. I was wondering how should one make a choice when annotating multiple token phrase as an entity.

In your Reddit INGRED label example, you choose to annotate onions as well as green onions with INGRED label. Was the latter necessary when the token onions itself is enough to be delimited as an NER entity of type INGRED? Are there any advantages of NER annotating green onions as well in the text?

honnibal · February 19, 2021, 1:16pm

Hi @santoshbs ,

This is a difficult question that will come down to your data, what your application needs at the end of it, and how easily you can use rules to adjust between different annotation policies.

A good principle to keep in mind when thinking about language annotations is compositionality. The phrase "green onions" actually means a bit more than the sum of its parts. It doesn't just mean onions that are coloured green, like the phrase "green apples" does. Rather, green onions are what are also known as scallions or spring onions: https://www.google.com/search?q=green+onions&rlz=1C5CHFA_enAU930AU930&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiW3K7kgvbuAhWjyjgGHQWkADgQ_AUoAXoECBIQAw&biw=1444&bih=710

If you have modifiers that behave entirely compositionally, you probably want to leave those out of an entity annotation. One reason is that you can generally swap in a whole constituent instead of just a word if you're doing normal syntactic composition. For instance, consider the phrase "very light yellow, almost translucent". This can be swapped in most places you could use an adjective, so you can have "very light yellow, almost translucent onions".

If your policy had been to annotate colours as part of the entity, suddenly you're stuck annotating this huge phrase as an entity. But non-compositional phrases won't work like this, generally. They tend to be fixed phrases, because otherwise nobody would be able to learn what they mean. You can call scallions "green onions", but you can't call them "light jade onions", even though "light jade" means roughly "green".

santoshbs · February 19, 2021, 6:18pm

Many thanks, @honnibal for pointing me to the notion of compositionality. This is a really good principle for me and collaborators to keep in mind as we annotate for a new entity.

By any chance, does Prodigy have a compilation of such guidelines for annotating entities? While you have made several extremely useful videos and documentation for using the Prodigy software and pipelines, some guidelines/tips for manual annotation or even pointers to some external resources that provide such guidelines would be of great help to novices like me. But I understand I might be asking for too much from the Prodigy team. Thanks, again!

honnibal · February 20, 2021, 5:07am

We do want to write up something more comprehensive that covers these topics, as I think it's not discussed clearly enough currently. We don't have that yet though.

Topic		Replies	Views
annotate multi phrases using ner.make-gold usage , ner	1	707	February 19, 2019
Multi-label NER usage , ner	1	1620	April 25, 2021
Multi-word entity seeding, entity context usage , ner	19	3957	November 1, 2019
Manual text typing usage , custom	2	931	February 25, 2018
ner.correct annotations with custom NER model ner , spacy , solved	6	939	November 9, 2020

Annotating a single-word vs multi-token phrase with a label: How to decide?

Related topics