I am annotating medical articles (https://github.com/chopeen/CORD-19/blob/master/data/raw/cord_19_rf_sentences.jsonl). The RISK_FACTOR names I am highlighting are sometimes compound phrases that contain multiple entities in a single span of text:
- "chromosomal and other anomalies"
- "substandard housing and living conditions"
- "previous use of carbapenems and quinolones"
- "originating from high ( 20 % ) or medium ( 18 % ) endemic area"
Ideally, they should translate to the following entities:
chromosomal anomalies +
substandard housing conditions +
substandard living conditions
previous use of carbapenems +
previous use of quinolones
originating from high endemic area +
originating from medium endemic area
I think I would need a feature to highlight overlapping entities that are sometimes not consecutive words.
That's not possible in Prodigy, right?
What's the best practice?
Should I highlight only the first entity or entire compound phrases?
I think you've highlighted exactly the issue here. I actually jotted down some thoughts on this on Twitter the other day: https://twitter.com/honnibal/status/1247820919013335040
Basically you're kind of at the edge of where span-based approaches are a worthwhile trade-off. Once you get too many syntactic effects, things stop being flat spans and the NER machinery's assumptions get in the way a bit more.
What you could try is a more tree-based approach, for instance by highlighting the head nouns and having rules to expand out to the dependent words.
The biggest problem I see is that an "entity" in your context isn't just one sort of thing, lingusitically. The phrase "originating from high endemic area" is an attributive clause, while "chromosomal anomalies" is a noun phrase. It'll be a lot easier for the model and for the annotation if you can sort things out so that things aren't so structurally diverse. This usually works well for the downstream logic too, because once you've extracted the entities, it'll be hard to do anything with them if they don't have structural consistency.
Edit: I just got done typing "maybe try annotating sentences based on whether they contain a risk factor", but I see you've done exactly that!
If you have the risk factor sentences, I wonder whether topic modelling would help? For instance, you could use Gensim and use LDA to do unsupervised topic modelling. This would give you soft clustering, and I'm guessing many of the clusters will correspond to risk factors.
From my simplistic perspective, NER seemed to be a solved problem - you just label enough data, train a model and voila, your custom NER is ready!
First, working with the CORD-19 dataset (related discussion and notebook) showed it's more complex than that and your explanation confirms it.
Thank you for linking http://markneumann.xyz/blog/numeric_annotation/ - very insighful!