Very interesting new project by @Andrey, built on top of spaCy and using Prodigy for custom domain-specific annotation
Med7 — an information extraction model for clinical natural language processing (built with spaCy & Prodigy)
We are about a year behind this project in building out models for suicide risk factor labeling of our clinical notes. But, we are following a similar path and have identified 7 custom entities to label in our privately held clinical notes.
In any event @Andrey, I would love to hear more details about the choices you made and the lessons you learned in your path to create Med7. The medium article is well done, so of course, it focuses on the big picture with few how-to details. I found the articles and github inspirational as I sometimes feel overwhelmed with so much to learn/do.
Speaking of needing help, I did post a note in the Consultants sticky note but it doesn't seem to get much traffic. Please let me know if you recall anyone you'd recommend for this kind of work.
Dear @sleclair0, apologise for my delayed reply. I am more than happy to share as many details as possible and help with your project. Actually, there should be (very soon) a new paper from our NLP group on using free-text electronic health records for suicide risk prediction. Please drop me an email (you can find it in our med7 pre-print, bottom of the first page).
Hey Andrey, thanks for releasing your model and sharing your paper. I have a question regarding the pre-trained language model. Did you use the
spacy pretrain module for this or did you do train this yourself using your own code?
I did use the
spacy pretrain CLI on the entire MIMIC corpus and experimented with various parameters. You can find my pretrained models on GitHub: https://github.com/kormilitzin/med7
Let me know should you have any further questions!
I missed that at the start of your
README, apologies for the obvious question! I'm working on a similar problem and Med7 has been really great inspiration.
I've been experimenting with
spacy pretrain and have a quick question for you. How did you package your pretrained weights into a Spacy language model
en_core_med7_lg? I've pretrained on a small corpus for about 50 epochs and have a
model50.bin file now, but apart from use in
spacy train -t2v model50.bin I'm not sure how to include as a package.
You should only need to provide it in the
spacy train command. After you save out your model from
spacy train, it will use the current value of the weights, which started from the initial model but have since changed. It's sort of like starting from a different save game: later you'll make more progress and create a new save point, and you don't need to carry around the first save-point with you to resume.
Echoing Matt, I used
spacy pretrain as the first step before training my model.
A simple workflow:
python -m spacy pretrain ./data/data_for_pretrain.jsonl en_vectors_web_lg ./token2vectors_weights
and then you pass the pertained weights to
python -m spacy train en ./nlp_model ./data/train.json ./data/test.json -t2v ./token2vectors_weights/model50.bin
When you package you model, just follow
spacy package (https://spacy.io/api/cli#package)
I hope this helps.
Thanks for getting back to me. I was confusing packaging the pretrained weights the in the same way the word vectors are packaged. I'll follow this and hopefully I'll have as good results as you had with Med7
Andrey! This is a great project. I've been using it for my purposes, but had a question.
Did you or the team try to ever do any matching between the entity types in Med7 and each 'DRUG' entity? It seems like DOSAGE, STRENGTH, FREQUENCY, etc. are related to the mention of a specific DRUG, and are logically "downstream" of that DRUG entity.
What I wanted to do was create a custom pipeline that, for all DRUG entities, set an extension that contained all the other entities that could be accessed.
sent = med7('She was prescribed Ibuprofen 200 mg daily for two weeks.') for ent in sent.ents: if ent.ent_label_ == 'DRUG: print(ent._.drug_attributes) >>> (('200 mg', 'STRENGTH'), ('daily','FREQUENCY'), ('for two weeks','DURATION'))
But the logic is hard. I've tried dependency parsing but there is every conceivable dependency one could imagine, so the rule-based approach is tough.
I figured you might have some experience or may have even pursued this functionality. I've had a lot of trouble because there are so many possible linguistic relationships one could envision between drug_attributes and each DRUG.
@honnibal What would be the best approach? Do you think associating these Med7 entities I'm calling "drug_attributes" (DOSAGE, STRENGTH, etc.) to a specific DRUG in each sentence is a task well-suited to a rule-set or matching? Or do you think this would probably be better for a statistical model?
By coincidence, I am also working on this problem of linking entities with a relationship. Spacy does not include this capability. This task is called relationship classification or relationship extraction. NLP Progress has a good summary of the current SOTA literature here. I'd recommend the Matching the Blanks paper and particularly this implementation of it (which uses spaCy!) and this blog post by the same person.
Ultimately, if you got as far as gathering data and training a model that can classify the relationship between your entities, you'd need to write a custom spaCy component that fills in the
drug_attributes of the
DRUG entity. It's not an easy task though.
If you are willing to go to those lengths, I'd recommend using the
rel.manual recipe in Prodigy to annotate the relationships between
DOSAGES, and so on (or whatever your entitie are).
I went ahead and made a simple, mechanical implementation! It doesn't rely on the dependency tags whatsoever, but upon the natural style of clinical records. Observe:
example = 'Patient was prescribed Ibuprofen 200 mg daily for two weeks, Dilaudid 2mg daily, morphine intravenously.'
The pattern is DRUG and its ATTRIBUTES, the next DRUG and its ATTRIBUTES, and etc.:
first set = (Ibuprofen, DRUG), (200 mg, STRENGTH), (daily, FREQUENCY), (for two weeks, DURATION)
second set = (Dilaudid, DRUG), (2mg, STRENGTH), (daily, FREQUENCY)
third set = (morphine, DRUG), (intravenously, ROUTE)
By creating a pointer and moving through the entities and assigning "attribute" entities to each drug based upon the pointer, I leveraged the rote patterns inherent to the style that clinical records are written in (clinicians are not literary, their notes are typically journalistic and list-like).
This has gotten us very high (I reckon 98%+, comparing our attributions to AWS Medical Comprehend) accuracy on several hundred records with several-to-many drugs per record. 89% accuracy is granted to us by dint of the fact that most sentences, if they have any DRUG label, have only one, so attribution is simple.
Let me know if you need any help! And I'd love an update as to your relationship linking as it comes along.