Med7 — an information extraction model for clinical natural language processing (built with spaCy & Prodigy)

Andrey! This is a great project. I've been using it for my purposes, but had a question.

Did you or the team try to ever do any matching between the entity types in Med7 and each 'DRUG' entity? It seems like DOSAGE, STRENGTH, FREQUENCY, etc. are related to the mention of a specific DRUG, and are logically "downstream" of that DRUG entity.

What I wanted to do was create a custom pipeline that, for all DRUG entities, set an extension that contained all the other entities that could be accessed.

sent = med7('She was prescribed Ibuprofen 200 mg daily for two weeks.')

for ent in sent.ents:
    if ent.ent_label_ == 'DRUG:
        print(ent._.drug_attributes)
>>> (('200 mg', 'STRENGTH'), ('daily','FREQUENCY'), ('for two weeks','DURATION'))

But the logic is hard. I've tried dependency parsing but there is every conceivable dependency one could imagine, so the rule-based approach is tough.

I figured you might have some experience or may have even pursued this functionality. I've had a lot of trouble because there are so many possible linguistic relationships one could envision between drug_attributes and each DRUG.

@honnibal What would be the best approach? Do you think associating these Med7 entities I'm calling "drug_attributes" (DOSAGE, STRENGTH, etc.) to a specific DRUG in each sentence is a task well-suited to a rule-set or matching? Or do you think this would probably be better for a statistical model?