Annotating compound entity phrases

honnibal · April 15, 2020, 10:19am

I think you've highlighted exactly the issue here. I actually jotted down some thoughts on this on Twitter the other day: https://twitter.com/honnibal/status/1247820919013335040

Basically you're kind of at the edge of where span-based approaches are a worthwhile trade-off. Once you get too many syntactic effects, things stop being flat spans and the NER machinery's assumptions get in the way a bit more.

What you could try is a more tree-based approach, for instance by highlighting the head nouns and having rules to expand out to the dependent words.

The biggest problem I see is that an "entity" in your context isn't just one sort of thing, lingusitically. The phrase "originating from high endemic area" is an attributive clause, while "chromosomal anomalies" is a noun phrase. It'll be a lot easier for the model and for the annotation if you can sort things out so that things aren't so structurally diverse. This usually works well for the downstream logic too, because once you've extracted the entities, it'll be hard to do anything with them if they don't have structural consistency.

Edit: I just got done typing "maybe try annotating sentences based on whether they contain a risk factor", but I see you've done exactly that!

If you have the risk factor sentences, I wonder whether topic modelling would help? For instance, you could use Gensim and use LDA to do unsupervised topic modelling. This would give you soft clustering, and I'm guessing many of the clusters will correspond to risk factors.

Topic		Replies	Views
Annotating a single-word vs multi-token phrase with a label: How to decide? usage , ner	3	375	February 20, 2021
Annotation of non-contiguous entities enhancement , ner , front-end	2	739	February 11, 2021
trying to link words in two spans to form 1 entity in prodigy. usage , ner	1	967	April 19, 2019
Highlighting spans that are not the entities to be labeled when using ner.correct usage , ner	1	454	December 21, 2020
Best practices for NER annotation ner , best-practices	2	753	March 16, 2021

Annotating compound entity phrases

Related topics