I have a large ontology of terms (medical, ~60k items) which I'm using with PhraseMatcher to get text spans (and also extract entities) in order to prepare a training data for NER task. The main problem is that some of the terms are written not in the form as they appear in the ontology. My question is how to allow PhraseMatcher to pick those terms in texts and extract them.
An important assumption: the way a term appears in the ontology is the most complete and longest one. In the text the term may appear at least as in the ontology or shorter (abbreviated).
For example:
import spacy nlp_blank = spacy.blank('en') drug_list = ['Adenuric', 'Adepend', 'Adgyn Combi', 'Adgyn XL', 'Alfuzosin HCl', 'Co-Magaldrox(Magnesium/Aluminium Hydrox')] matcher = PhraseMatcher(nlp_blank.vocab) matcher.add('DRUG', None, *[nlp_blank(entity_i) for entity_i in drug_list]) doc = nlp_blank("A patient was prescribed Adepend 5mg, Alfuzosin 20ml and co-magaldrox 5 mg") matches = matcher(doc) for m_id, start, end in matches: entity = doc[start : end] print((entity.text, entity.start_char, entity.end_char, nlp_blank.vocab.strings[m_id]))
The result (now):
('Adepend', 25, 32, 'DRUG')
What I would like to have is:
('Adepend', 25, 32, 'DRUG')
('Alfuzosin', 38, 47, 'DRUG')
('Co-Magaldrox', 57, 69, 'DRUG')
At the moment, I'm trying to include 'fuzzy woozy' python package, for partial matching:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
text = "A patient was prescribed Adepend 5mg, Alfuzosin 20ml and co-magaldrox 5 mg"
for query in drug_list:
print(process.extractOne(query, text.split()))
The output:
('A', 90)
('Adepend', 100)
('A', 60)
('A', 90)
('Alfuzosin', 100)
('co-magaldrox', 90)
which is also brings irrelevant cases ('A'), but deals nicely with fuzzy matching.
Any ideas how to solve the partial (fuzzy) matching elegantly with spaCy?