Fuzzy string pattern matching in NER recipe with spaczz

Hi!
If you work on text data with many spelling errors or unconventional spelling (e.g., social media), you might benefit from fuzzy string matching.
spaczz library for fuzzy (and regex) pattern matching has been built as an alternative for spaCy's Matcher, which makes it very easy to integrate it into Prodigy pattern-based recipes.
I have created an example NER manual recipe that uses spaczz FuzzyMatcher instead of spaCy PhraseMatcher to power span pre highlighting. I have also uploaded example patterns covering the names of skateboarding tricks and an example corpus extracted from the skateboarding subreddit, which contains some misspelled trick names.
To try this out, run the following command from the prodigy-recipes repo:

prodigy ner.fuzzy.manual tricks blank:en example-datasets/skate_reddit.jsonl example-patterns/patterns_skateboarding_tricks-TRICK.jsonl --label TRICK -F ner/ner_fuzzy_manual.py

This will bring up Prodigy NER manual UI with the names of the tricks pre highlighted, including the "misspelled" ones!

Please note that spaczz FuzzyMatcher used in this recipe does not resolve multiple overlapping matches. To support overlapping matches please check spaczz SpaczzRuler.

We highly recommend checking spaczz documentation for more options to optimize your pattern matching further. :+1:

Links:

2 Likes

Hi

Thanks for your post. When I write this code from documentation I got this error:

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank('en')
ruler = SpaczzRuler(nlp)
ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])

nlp.add_pipe(ruler)

doc = nlp('Oops, I spelled Bill Gatez wrong.')
print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])

What will I do?

Hi @ta13,
Thanks for trying this out!
The error you are getting is related to the incorrect signature of the add_pipe method. It's true that spaczz documention has not been updated to SpaCy v3 there. As of SpaCy v3 add_pipe requires the string name of the component being added.
For SpaczzRuler it would be:
spaczz_ruler = nlp.add_pipe("spaczz_ruler")
To rewrite your full example:

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank('en')
ruler = nlp.add_pipe("spaczz_ruler")
ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])
doc = nlp('Oops, I spelled Bill Gatez wrong.')
print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])

Hope that helps!

1 Like