Domain-specific NER project

Hi everyone, its great to be part of the Prodigy community!! I have only started learning to use spaCy few wks ago, and exposed myself to Prodigy only yesterday. Hope to learn and grow with everyone’s opinions and advice :slight_smile:

I’m currently working on domain-specific project, which deals with entity recognition in the healthcare sector. As such, any pre-trained models that are trained on medical texts are extremely useful and will expedite any form of entity recognition (eg. Cause of Illness, Treatment, Diagnosis, Duration, Impression, etc).

  1. Are there any recommendation for open source pre-trained models that might be useful? I have seen open source models such as medaCy, PubMed, or how about models in spaCy’s universe (such as Kindred, saber, scispaCy)?

As mentioned, I will be working on NER extraction from text files consisting of doctors’ prescription/diagnostic notes. I am currently focusing on NE “Causes of Illness” (for eg. if someone has a fever, the cause of illness could be that he had a throat inflammation, or he was bitten by an disease-carrying bug, or simply, there could also be no known reason.

Here are my 3 proposals to approach this:
A) Thinking on a superficial level, I could just perform terms.teach on known “Causes of Illness” BUT not using en_core_web_lg model. Instead, I will use the pre-trained model trained on healthcare to get a better word embedding and better similarity comparison. Is ner.match useful here too?

B) I could use possibly pattern matching (but not too sure how to) to identify prefixes for “Causes of Illness”. For eg. I could find phrases such as “because”, “due to”, “as a result of”, etc. How can I then teach the model to focus more on the texts succeeding these phrases for tagging of “Causes of Illness”?

C) Could I associate 2 labels together (for eg. the presence of label “ILLNESS” might increase likelihood of the presence of “Causes of Illness”). Can the dependency algo be used here (dep.teach, etc)?

If anyone has any suggestion, please fire them away… Thanks for taking the time to read this. Any advice is appreciated.

I'd definitely recommend scispaCy, yes!

There are some potential problems with actually trying to model CAUSE_OF_ILLNESS as a pure named entity recognition task. spaCy's named entity recognizer is very much optimised for predicting named entities, e.g. named objects and proper nouns. If you're trying to label spans that are more like half-sentences and longer phrases, it might not actually work as well. Another problem you may encounter is that causes of illness can have pretty fuzzy boundaries, so it can be quite difficult to annotate that category consistently. One annotator may include "as a result of", another annotator might start the span as "result of". This makes the distinction even harder to train.

One way to make the problem easier to model is to focus on the more general "concepts" or "things" and resolve the semantics later. For instance, you could label "bacteria infection" or "Salmonellosis" etc. (sorry, I'm not a bio NLP person and can't think of better examples right now :wink:). You can definitely use ner.match for this and load in a dictionary of some common terms here – or use spaCy's Matcher to pre-select them in your data and then use ner.manual to correct the result. The more you can automate and pre-highlight, the better :slightly_smiling_face:

Once you've trained a model to predict those concepts, you can add rules that determine whether it's a cause of illness. Your post already mentioned some trigger words like "because", "due to" etc. So for each entity, you can check the previous tokens and words that it's attached to, and if they're part of your trigger words, expand the entity and mark it as a cause of illness. You can find some examples of this here: Rule-based matching · spaCy Usage Documentation

If the combination of NER and rules I outlined above works for you, you could incorporate this in your rules and look for both trigger words and the presence of ILLNESS. The statistical model can potentially also learn to pick up on this if the entities are close enough to each other, since the predicted entitiy spans are influenced by a window of 4 tokens on either side.

The dependency parser doesn't necessarily come into play here. However, if you're using the syntactic dependencies in your rules (e.g. to find the trigger words an entity is attached to), you might consider improving it on your specific data if it's predictions aren't good enough.

Btw, you might also want to check out the "medical" tag on the forum: https://support.prodi.gy/tags/medical