Annotating compound entity phrases

chopeen · April 13, 2020, 7:18pm

I am annotating medical articles (https://github.com/chopeen/CORD-19/blob/master/data/raw/cord_19_rf_sentences.jsonl). The RISK_FACTOR names I am highlighting are sometimes compound phrases that contain multiple entities in a single span of text:

"chromosomal and other anomalies"
"substandard housing and living conditions"
"previous use of carbapenems and quinolones"
"originating from high ( 20 % ) or medium ( 18 % ) endemic area"

Ideally, they should translate to the following entities:

chromosomal anomalies + other anomalies
substandard housing conditions + substandard living conditions
previous use of carbapenems + previous use of quinolones
originating from high endemic area + originating from medium endemic area

I think I would need a feature to highlight overlapping entities that are sometimes not consecutive words.

That's not possible in Prodigy, right?

What's the best practice?
Should I highlight only the first entity or entire compound phrases?

honnibal · April 15, 2020, 10:19am

I think you've highlighted exactly the issue here. I actually jotted down some thoughts on this on Twitter the other day: https://twitter.com/honnibal/status/1247820919013335040

Basically you're kind of at the edge of where span-based approaches are a worthwhile trade-off. Once you get too many syntactic effects, things stop being flat spans and the NER machinery's assumptions get in the way a bit more.

What you could try is a more tree-based approach, for instance by highlighting the head nouns and having rules to expand out to the dependent words.

The biggest problem I see is that an "entity" in your context isn't just one sort of thing, lingusitically. The phrase "originating from high endemic area" is an attributive clause, while "chromosomal anomalies" is a noun phrase. It'll be a lot easier for the model and for the annotation if you can sort things out so that things aren't so structurally diverse. This usually works well for the downstream logic too, because once you've extracted the entities, it'll be hard to do anything with them if they don't have structural consistency.

Edit: I just got done typing "maybe try annotating sentences based on whether they contain a risk factor", but I see you've done exactly that!

If you have the risk factor sentences, I wonder whether topic modelling would help? For instance, you could use Gensim and use LDA to do unsupervised topic modelling. This would give you soft clustering, and I'm guessing many of the clusters will correspond to risk factors.

chopeen · April 17, 2020, 11:03pm

From my simplistic perspective, NER seemed to be a solved problem - you just label enough data, train a model and voila, your custom NER is ready!

First, working with the CORD-19 dataset (related discussion and notebook) showed it's more complex than that and your explanation confirms it.

Thank you for linking http://markneumann.xyz/blog/numeric_annotation/ - very insighful!

Topic		Replies	Views
Annotating a single-word vs multi-token phrase with a label: How to decide? usage , ner	3	375	February 20, 2021
Multi-label NER usage , ner	1	1628	April 25, 2021
Mapping relationships between named entities and unlabeled spans ner , medical	1	634	November 12, 2021
Annotation of non-contiguous entities enhancement , ner , front-end	2	739	February 11, 2021
annotate multi phrases using ner.make-gold usage , ner	1	707	February 19, 2019

Annotating compound entity phrases

Related topics