Annotating compound entity phrases

I think you've highlighted exactly the issue here. I actually jotted down some thoughts on this on Twitter the other day: https://twitter.com/honnibal/status/1247820919013335040

Basically you're kind of at the edge of where span-based approaches are a worthwhile trade-off. Once you get too many syntactic effects, things stop being flat spans and the NER machinery's assumptions get in the way a bit more.

What you could try is a more tree-based approach, for instance by highlighting the head nouns and having rules to expand out to the dependent words.

The biggest problem I see is that an "entity" in your context isn't just one sort of thing, lingusitically. The phrase "originating from high endemic area" is an attributive clause, while "chromosomal anomalies" is a noun phrase. It'll be a lot easier for the model and for the annotation if you can sort things out so that things aren't so structurally diverse. This usually works well for the downstream logic too, because once you've extracted the entities, it'll be hard to do anything with them if they don't have structural consistency.

Edit: I just got done typing "maybe try annotating sentences based on whether they contain a risk factor", but I see you've done exactly that!

If you have the risk factor sentences, I wonder whether topic modelling would help? For instance, you could use Gensim and use LDA to do unsupervised topic modelling. This would give you soft clustering, and I'm guessing many of the clusters will correspond to risk factors.

1 Like