Fuzzy string pattern matching in NER recipe with spaczz

magdaaniol · February 14, 2022, 12:15pm

Hi!
If you work on text data with many spelling errors or unconventional spelling (e.g., social media), you might benefit from fuzzy string matching.
spaczz library for fuzzy (and regex) pattern matching has been built as an alternative for spaCy's Matcher, which makes it very easy to integrate it into Prodigy pattern-based recipes.
I have created an example NER manual recipe that uses spaczz FuzzyMatcher instead of spaCy PhraseMatcher to power span pre highlighting. I have also uploaded example patterns covering the names of skateboarding tricks and an example corpus extracted from the skateboarding subreddit, which contains some misspelled trick names.
To try this out, run the following command from the prodigy-recipes repo:

prodigy ner.fuzzy.manual tricks blank:en example-datasets/skate_reddit.jsonl example-patterns/patterns_skateboarding_tricks-TRICK.jsonl --label TRICK -F ner/ner_fuzzy_manual.py

This will bring up Prodigy NER manual UI with the names of the tricks pre highlighted, including the "misspelled" ones!

Please note that spaczz FuzzyMatcher used in this recipe does not resolve multiple overlapping matches. To support overlapping matches please check spaczz SpaczzRuler.

We highly recommend checking spaczz documentation for more options to optimize your pattern matching further.

Links:

spaczz documentation GitHub - gandersen101/spaczz: Fuzzy matching and more functionality for spaCy.
ner.fuzzy.manual example recipe: prodigy-recipes/ner_fuzzy_manual.py at master · explosion/prodigy-recipes · GitHub
example patterns with skateboarding tricks: prodigy-recipes/patterns_skateboarding_tricks-TRICK.jsonl at master · explosion/prodigy-recipes · GitHub
skateboarding subreddit corpus: prodigy-recipes/skate_reddit.jsonl at master · explosion/prodigy-recipes · GitHub

ta13 · February 16, 2022, 1:43pm

Hi

Thanks for your post. When I write this code from documentation I got this error:

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank('en')
ruler = SpaczzRuler(nlp)
ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])

nlp.add_pipe(ruler)

doc = nlp('Oops, I spelled Bill Gatez wrong.')
print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])

What will I do?

magdaaniol · February 16, 2022, 6:32pm

Hi @ta13,
Thanks for trying this out!
The error you are getting is related to the incorrect signature of the add_pipe method. It's true that spaczz documention has not been updated to SpaCy v3 there. As of SpaCy v3 add_pipe requires the string name of the component being added.
For SpaczzRuler it would be:
spaczz_ruler = nlp.add_pipe("spaczz_ruler")
To rewrite your full example:

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank('en')
ruler = nlp.add_pipe("spaczz_ruler")
ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])
doc = nlp('Oops, I spelled Bill Gatez wrong.')
print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])

Hope that helps!

Topic		Replies	Views
Creating a custom recipe to integrate bespoke model usage , ner , custom , solved	3	716	November 12, 2019
Fuzzy (partial) matching with PhraseMatcher (NER task) usage , spacy , solved , medical	10	10057	January 13, 2020
Can't get phrase matching to work spancat	3	295	June 27, 2023
textcat.manual with --patterns argument enhancement , textcat	7	1100	September 25, 2019
Combining ner.teach with patterns file and manual correction of spans usage , ner , front-end	2	785	September 11, 2020

Fuzzy string pattern matching in NER recipe with spaczz

Related topics