Fuzzy (partial) matching with PhraseMatcher (NER task)

I have a large ontology of terms (medical, ~60k items) which I'm using with PhraseMatcher to get text spans (and also extract entities) in order to prepare a training data for NER task. The main problem is that some of the terms are written not in the form as they appear in the ontology. My question is how to allow PhraseMatcher to pick those terms in texts and extract them.

An important assumption: the way a term appears in the ontology is the most complete and longest one. In the text the term may appear at least as in the ontology or shorter (abbreviated).

For example:

import spacy
nlp_blank = spacy.blank('en')

drug_list = ['Adenuric', 'Adepend', 'Adgyn Combi', 'Adgyn XL', 'Alfuzosin HCl', 'Co-Magaldrox(Magnesium/Aluminium Hydrox')]
matcher = PhraseMatcher(nlp_blank.vocab)
matcher.add('DRUG', None, *[nlp_blank(entity_i) for entity_i in drug_list])


doc = nlp_blank("A patient was prescribed Adepend 5mg, Alfuzosin 20ml and co-magaldrox 5 mg")
matches = matcher(doc)

for m_id, start, end in matches:
    entity = doc[start : end] 
    print((entity.text, entity.start_char, entity.end_char, nlp_blank.vocab.strings[m_id]))

The result (now):

('Adepend', 25, 32, 'DRUG')

What I would like to have is:

('Adepend', 25, 32, 'DRUG')
('Alfuzosin', 38, 47, 'DRUG')
('Co-Magaldrox', 57, 69, 'DRUG')

At the moment, I'm trying to include 'fuzzy woozy' python package, for partial matching:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process


text = "A patient was prescribed Adepend 5mg, Alfuzosin 20ml and co-magaldrox 5 mg"

for query in drug_list:
    print(process.extractOne(query, text.split()))

The output:

('A', 90)
('Adepend', 100)
('A', 60)
('A', 90)
('Alfuzosin', 100)
('co-magaldrox', 90)

which is also brings irrelevant cases ('A'), but deals nicely with fuzzy matching.

Any ideas how to solve the partial (fuzzy) matching elegantly with spaCy?

2 Likes

spaCy's matcher doesn't do any subword matching, so your fuzzywuzzy solution seems like a pretty good way to do this.

If speed is a concern, you could also consider adding more patterns to the PhraseMatcher with the different forms: e.g. "Adgyn Combi", "Adgyn" etc. Even if you end up with 200k patterns that way, the phrase matcher should still run just as fast, because it's mostly sensitive to the set of word types being searched for. Adding different combinations of those words doesn't really change the performance.

You should be able to generate those additional patterns programmatically by splitting / tokenizing the strings. You could even use Prodigy to click through the terms and kick out the ones that you know are invalid or too ambiguous.

If you want to do this more elgantly, you could wrap the logic in a custom pipeline component that is initialized with the list of strings, takes care of filtering duplicates and ā€“ optionally? ā€“ removes overlapping matches and adds the results to the doc.ents. This could be a really nice spaCy plugin, actually :smiley:

On a related note, this might be relevant to you: The current spacy-nightly already supports changing the attribute the PhraseMatcher matches on. So instead of matching on ORTH (token.text), you could use LOWER for case-insensitive matches. Details here:

1 Like

Hi Ines,

Many thanks for your great suggestions (as usual). Iā€™ve generated (programmatically) a pretty large collection of possible tokens from my list, the problem is (as you mentioned), some of them are ā€˜junkā€™ (bits of words). For example: ā€˜blood thinnerā€™ is a drug, while ā€˜bloodā€™ and ā€˜thinnerā€™ are note. Iā€™m using Prodigy to refine suggestions, but I canā€™t create a very large training data, as I still need to click (accept or reject).

Anyway, I will try the steps you suggested and letā€™s see what comes out.

Cool, definitely interested in hearing how you go!

Maybe you can come up with a clever way of sorting the terms and filtering out the terms that are most likely problematic? This might get you down to a few thousand terms, which you can more easily refine using a simple recipe like mark where you just click yes or no.

For example, it's fair to assume that most single-letter tokens are likely not useful, right? You could also check the terms against a standard English dictionary ā€“ this would also flag false positives like "blood thinner", but likely also a lot of other stuff that's not a drug on its own.

You could also check the terms against a standard English dictionary

That might be very helpful! How do I compare my tokens to a standard English dictionary?

An easy way to start could be to use the he en_vectors_web_lg model, which comes with a large vocabulary. The vocab is based on the training data, so it should have a good coverage of general-purpose language. You especially want to check the token.is_oov (whether the token is out-of-vocabulary) and token.prob, the single-word probability, which tells you whether the word is common or not. I guess if the word is very common, this could be an indicator that itā€™s not a term youā€™re looking for.

There are also various dictionary APIs you could use. Not sure what the terms and rate limits are, but you might find a developer-friendly one that lets you make a bunch of requests.

Amazing, many thanks for your time and suggestion! Highly appreciated.

1 Like

Hi,

I also wanted the fuzzy phrase matcher option and I put my solution on github. I am sure there is other ways of how to achieve this, but it works.

Cheers,

1 Like

@jackmen Nice, thanks for sharing! :smiley: (If you want to wrap this as a Python package and submit it to the spaCy Universe, that'd be cool, too! Fuzzy matching is something that comes up occasionally and we currently don't have a built-in solution. So a plugin like this is really nice to have.)

Dear Ines,

yes, that would be cool. I will fine tune this a little bit and submit it to spacy universe once ready.

Thanks a lot for all the nice tools you guys provide!!

Cheers,

1 Like

Looks nice @jackmen.

A better solution would be to get the matched phrase back from fuzzywuzzy itself instead of building your own custom logic around it. Please vote for the issue :).