I've trained the NER pipeline in my spaCy workflow to identify a new entity label—DRUG
—very well. The next task is to identify logically "downstream" related entities, like DOSAGE
. What I mean is, for my purposes, this latter entity requires a DRUG
entity (but also, not at all DRUG
entities have a DOSAGE
entity).
Examples:
"I prescribed the patient Lipitor (ent: DRUG
) 20 mg (sub_ent: DOSAGE
) daily."
"Patient is allergic to Acetaminophen (ent: DRUG
)."
"Systolic blood pressure of 132 mmHg (ent: TEST
)
What's the best way to structure the pipeline to accomplish this task? My intuition is to create a new pipeline component and use the rule-based matcher to find nearby numeric values and measurements for drugs identified in the NER pipe upstream.
Alternatively, I could use a flat NER scheme that treats DOSAGE
as a first-class entity likeDRUG
(but this seems messier, more prone to false positives).
Another idea is to have a latter pipeline component that, instead of rule-based matcher, is itself an NER component from a base spaCy language model, and it, too, only works on those entities labeled DRUG by the upstream NER component.
Thanks for the help!