How to handle multiple concepts in the same phrase joined by a conjunction

We are new to Prodigy. We have a NER task to identify concepts in medical text. Given the phrase: "the patient had difficulty in writing, reading, and speaking". We would like to tag 3 concepts in this phrase: "difficulty writing", "difficulty reading", and "difficulty speaking". Is there a way in SpaCy to convert the phrase to the "patient had difficulty reading, difficulty writing, and difficulty speaking" or is there a work-around in Prodigy? Failing all else, we would need to identify the rather common situation and parse the compound phrase into 3 smaller phrases. Any thoughts?

This is something we'd really like to have a spaCy component (or perhaps extension) for, but don't currently. There are three main ways you could try to do this:

  1. Sequence-to-sequence. This is essentially like a machine translation task. The model would likely work, but it will take more training data, will perform slowly, and might return strange outputs sometimes. You could consider constraining the output in various ways, for instance by saying that it can only output vocabulary items that are in the original sentence. You'd want to implement the model with the transformers library, and only use Prodigy for the annotation. You'd probably want to use a text-box as the annotation interface. I've never done this myself but I expect it should work.

  2. Statistical model over dependency parse. I wouldn't really recommend this, but instead of rules you could consider training a model to predict which nodes of the dependency parse are conjunctions of interest. I think this will be much more complicated, and share the same vulnerability to parse errors.

  3. Rules over the dependency parse. This should work, depending on the range of constructions you're working with. You won't have to annotate training data, only evaluation data. The model can potentially be faster, too.

One problem will be if the dependency parse is wrong. Accuracy over conjunctions is lower than accuracy over other types of constructions, especially for coordinating commas. For instance, have a look at this parse by the v2.3 small English model:

Have a look at the dependency parse here: displaCy Dependency Visualizer ยท Explosion

This parse is incorrect: the coordination is attached to "difficulty", which is the structure you'd expect if the person had said "the patient had reading, speaking and difficulty in writing". The correct parse would attach the coordination to "writing" instead.

You might find that there are only a few error-types the parser makes, so your rules can correct for them...But then the rules will be specific to the version of the parser you engineer them for. You might find instead that the transformer-based model performs much more accurately.

Overall I would suggest trying approach 3. Possibly you can even find some existing work that has done this. The Holmes information extraction project says it does this type of conjunction resolution, but I had a look at the code and it seems there's no simple function we could extract that would do it.

Matthew,
Thank you for the careful consideration of our problem. We suspected that phrases linked by conjunctions would be a problem. MetaMap (the ruled based system from the NIH) also has problems with conjunctions even though they have a option called "conjunction processing"). MetaMap cannot resolve the sentence I gave you in the example. Until we find a solution, we will probably try some ruled-based methods to expand phrases with conjunctions and do our annotating on post-processed text rather than native text (after distributing the modifying adjective over all nouns linked by the conjunction). i.e. we will annotate difficulty in writing, difficulty in reading, and difficulty in speaking as three distinct concepts rather than try to train on the entire concept as one. Our plan is to annotate only one NER class which we call a concept which corresponds to concepts in the UMLS Metathesaurus of the NIH in the USA. We have a subset target ontology called the neuro-ontology based on the UMLS. It only has 1400 named entities. Thanks again,

Daniel Hier, a retired neurologist and amateur data engineer