What is the optimum amount of entity labels?

ldorigo · May 6, 2019, 6:02pm

Hi there.

I’m starting to experiment around with Prodigy. First of all, I want to say that it is an outstanding piece of software, and it’s hard to believe that it comes from only two people. Great work!

Are there any general guidelines for how fine-grained the entity labels should be?

Is it better to have very generic labels that correspond to many types of entities (up to the extreme case where entities are simply tagged “entity” as in the core model of https://allenai.github.io/scispacy/)? Or is it better to have finer-grained labels that maybe correspond to entities that will appear in similar syntactic constructions? If so, what is an upper limit to how many different labels one should include?

And as a small side-question: is there anything that limits entities to concepts (as is the case in most Spacy models I’ve seen)? Could I, for instance, define a “relation indicator” entity that matches things like:

“Vitamin D deficiency is strongly associated with fatigue.”

or:

“There are clear signs that narcolepsy can be caused by dysfunctions in GABA receptors”

Cheers!

Luca

honnibal · May 6, 2019, 8:47pm

Thanks for the kind words!

It’s hard to give very general-purpose guidance on how fine-grained to make the entities. Here’s one way to reason about it. Instead of predicting the entity label, you could have a cascaded system, that separately predicts some subtypes of your entities. In other words, the fine-grained type is sort of like a joint model, like P(x, y), while predicting the subtype separately is more like P(x) * P(y|x).

Thinking of it this way, we can use the much more general question, “When is a joint model a good idea?”. A joint model is generally a good idea when you need to predict both x and y and there’s high mutual information between the classes that’s not captured by other features. Said more simply: when you have problems where knowing one class variable would help a lot in predicting another class variable, and vice versa, it can be helpful to predict them at the same time.

Ultimately the answers will be empirical, but I think you’ll probably find that extra entity typing probably doesn’t improve the accuracy of identifying entity spans very much. For very fine-grained entity types, you’re almost certainly better resolving the mentjon against a knowledge base, and using that to tag the types.

Finally, to answer your other question: you could use the entity recogniser in this way, but it’s a bit difficult to get right. The problem is that the boundaries of your phrases are going to be a bit vague, and your annotators will struggle to follow a consistent policy. You might be better off annotating specific trigger words, instead of the phrases. Words will also make it easier to leverage the dependency parse to find relations. I have a new component planned for spaCy to fulfill this type of use-case. I discuss this here: https://twitter.com/honnibal/status/1111990886483853312

ldorigo · May 8, 2019, 7:54pm

Thanks for the explanation! And great to hear that you’re working on a component that will make that specific task easier

Topic		Replies	Views
Add more 3 new entity type usage , ner	4	593	November 1, 2019
Manual NER with huge count of entities usage , ner	1	505	December 18, 2018
NER with dozens of entities usage , ner	4	749	April 16, 2021
Generic or specific entities and multilabel text categorization ner , textcat , best-practices	1	669	April 22, 2019
Annotate for NER and classification at the same time ner , best-practices	1	471	October 19, 2021

What is the optimum amount of entity labels?

Related Topics