What is the optimum amount of entity labels?

Hi there.

I’m starting to experiment around with Prodigy. First of all, I want to say that it is an outstanding piece of software, and it’s hard to believe that it comes from only two people. Great work!

Are there any general guidelines for how fine-grained the entity labels should be?

Is it better to have very generic labels that correspond to many types of entities (up to the extreme case where entities are simply tagged “entity” as in the core model of https://allenai.github.io/scispacy/)? Or is it better to have finer-grained labels that maybe correspond to entities that will appear in similar syntactic constructions? If so, what is an upper limit to how many different labels one should include?

And as a small side-question: is there anything that limits entities to concepts (as is the case in most Spacy models I’ve seen)? Could I, for instance, define a “relation indicator” entity that matches things like:

“Vitamin D deficiency is strongly associated with fatigue.”

or:

“There are clear signs that narcolepsy can be caused by dysfunctions in GABA receptors”

Cheers!

Luca

Thanks for the kind words!

It’s hard to give very general-purpose guidance on how fine-grained to make the entities. Here’s one way to reason about it. Instead of predicting the entity label, you could have a cascaded system, that separately predicts some subtypes of your entities. In other words, the fine-grained type is sort of like a joint model, like P(x, y), while predicting the subtype separately is more like P(x) * P(y|x).

Thinking of it this way, we can use the much more general question, “When is a joint model a good idea?”. A joint model is generally a good idea when you need to predict both x and y and there’s high mutual information between the classes that’s not captured by other features. Said more simply: when you have problems where knowing one class variable would help a lot in predicting another class variable, and vice versa, it can be helpful to predict them at the same time.

Ultimately the answers will be empirical, but I think you’ll probably find that extra entity typing probably doesn’t improve the accuracy of identifying entity spans very much. For very fine-grained entity types, you’re almost certainly better resolving the mentjon against a knowledge base, and using that to tag the types.

Finally, to answer your other question: you could use the entity recogniser in this way, but it’s a bit difficult to get right. The problem is that the boundaries of your phrases are going to be a bit vague, and your annotators will struggle to follow a consistent policy. You might be better off annotating specific trigger words, instead of the phrases. Words will also make it easier to leverage the dependency parse to find relations. I have a new component planned for spaCy to fulfill this type of use-case. I discuss this here: https://twitter.com/honnibal/status/1111990886483853312

Thanks for the explanation! And great to hear that you’re working on a component that will make that specific task easier :slight_smile: