If you work on text data with many spelling errors or unconventional spelling (e.g., social media), you might benefit from fuzzy string matching.
spaczz library for fuzzy (and regex) pattern matching has been built as an alternative for spaCy's Matcher, which makes it very easy to integrate it into Prodigy pattern-based recipes.
I have created an example NER manual recipe that uses
FuzzyMatcher instead of spaCy
PhraseMatcher to power span pre highlighting. I have also uploaded example patterns covering the names of skateboarding tricks and an example corpus extracted from the skateboarding subreddit, which contains some misspelled trick names.
To try this out, run the following command from the prodigy-recipes repo:
prodigy ner.fuzzy.manual tricks blank:en example-datasets/skate_reddit.jsonl example-patterns/patterns_skateboarding_tricks-TRICK.jsonl --label TRICK -F ner/ner_fuzzy_manual.py
This will bring up Prodigy NER manual UI with the names of the tricks pre highlighted, including the "misspelled" ones!
Please note that
FuzzyMatcher used in this recipe does not resolve multiple overlapping matches. To support overlapping matches please check
We highly recommend checking
spaczz documentation for more options to optimize your pattern matching further.
spaczzdocumentation GitHub - gandersen101/spaczz: Fuzzy matching and more functionality for spaCy.
ner.fuzzy.manualexample recipe: prodigy-recipes/ner_fuzzy_manual.py at master · explosion/prodigy-recipes · GitHub
- example patterns with skateboarding tricks: prodigy-recipes/patterns_skateboarding_tricks-TRICK.jsonl at master · explosion/prodigy-recipes · GitHub
- skateboarding subreddit corpus: prodigy-recipes/skate_reddit.jsonl at master · explosion/prodigy-recipes · GitHub