Med7 — an information extraction model for clinical natural language processing (built with spaCy & Prodigy)

Very interesting new project by @Andrey, built on top of spaCy and using Prodigy for custom domain-specific annotation :sparkles:

2 Likes

Thanks for the post @ines. Thanks for the medium/arxiv articles and for making Med7 downloadable and pluggable within the spaCy Universe @Andrey.

We are about a year behind this project in building out models for suicide risk factor labeling of our clinical notes. But, we are following a similar path and have identified 7 custom entities to label in our privately held clinical notes.

In any event @Andrey, I would love to hear more details about the choices you made and the lessons you learned in your path to create Med7. The medium article is well done, so of course, it focuses on the big picture with few how-to details. I found the articles and github inspirational as I sometimes feel overwhelmed with so much to learn/do.

Speaking of needing help, I did post a note in the Consultants sticky note but it doesn't seem to get much traffic. Please let me know if you recall anyone you'd recommend for this kind of work.

Dear @sleclair0, apologise for my delayed reply. I am more than happy to share as many details as possible and help with your project. Actually, there should be (very soon) a new paper from our NLP group on using free-text electronic health records for suicide risk prediction. Please drop me an email (you can find it in our med7 pre-print, bottom of the first page).

2 Likes

Hey Andrey, thanks for releasing your model and sharing your paper. I have a question regarding the pre-trained language model. Did you use the spacy pretrain module for this or did you do train this yourself using your own code?

Thanks,
Dan

Hi Dan,

I did use the spacy pretrain CLI on the entire MIMIC corpus and experimented with various parameters. You can find my pretrained models on GitHub: https://github.com/kormilitzin/med7

Let me know should you have any further questions!

I missed that at the start of your README, apologies for the obvious question! I'm working on a similar problem and Med7 has been really great inspiration.

Hi Andrey,

I've been experimenting with spacy pretrain and have a quick question for you. How did you package your pretrained weights into a Spacy language model en_core_med7_lg? I've pretrained on a small corpus for about 50 epochs and have a model50.bin file now, but apart from use in spacy train -t2v model50.bin I'm not sure how to include as a package.

Maybe this question would be better posed to @ines or @honnibal!

Thanks,
Dan

Hi @1danjordan,

You should only need to provide it in the spacy train command. After you save out your model from spacy train, it will use the current value of the weights, which started from the initial model but have since changed. It's sort of like starting from a different save game: later you'll make more progress and create a new save point, and you don't need to carry around the first save-point with you to resume.

1 Like

Hi @1danjordan,

Echoing Matt, I used spacy pretrain as the first step before training my model.

A simple workflow:

python -m spacy pretrain ./data/data_for_pretrain.jsonl en_vectors_web_lg ./token2vectors_weights

and then you pass the pertained weights to spacy train:

python -m spacy train en ./nlp_model ./data/train.json ./data/test.json -t2v ./token2vectors_weights/model50.bin

When you package you model, just follow spacy package (https://spacy.io/api/cli#package)

I hope this helps.

1 Like

Thanks Andrey,

Thanks for getting back to me. I was confusing packaging the pretrained weights the in the same way the word vectors are packaged. I'll follow this and hopefully I'll have as good results as you had with Med7 :crossed_fingers:

Cheers,
Dan

1 Like