One thing you could do is add a second pipeline component after the entity recognizer that looks for DRUG
entities and then sets a custom attribute on those entities specifying a list of subtypes (e.g. ent._.entity_subtypes
).
Here’s an example of a rule-based solution – but you could obviously also swap out the dictionary for a statistical solution, or a combination.
from spacy.tokens import Span
# dictionary of lowercase entities mapped to subtypes
DRUG_SUBTYPES = {
'citalopram': ['ANTIDEPRESSANT', 'SOMETHING_ELSE'],
'lexapro': ['ANTIDEPRESSANT'],
# etc.
}
# register global span._.entity_subtype extension
Span.set_extension('entity_subtypes', default=None)
def assign_subtypes(doc):
# this function will be added after the NER in the pipeline
for ent in doc.ents:
if ent.label_ == 'DRUG':
# look up entity text and set custom attribute
ent._.entity_subtypes = DRUG_SUBTYPES.get(ent.text)
return doc
You could then use the component like this:
nlp = spacy.load('/path/to/your/drugs/model')
nlp.add_pipe(assign_subtypes, after='ner')