Hi!
If you work on text data with many spelling errors or unconventional spelling (e.g., social media), you might benefit from fuzzy string matching.
spaczz
library for fuzzy (and regex) pattern matching has been built as an alternative for spaCy's Matcher, which makes it very easy to integrate it into Prodigy pattern-based recipes.
I have created an example NER manual recipe that uses spaczz
FuzzyMatcher
instead of spaCy PhraseMatcher
to power span pre highlighting. I have also uploaded example patterns covering the names of skateboarding tricks and an example corpus extracted from the skateboarding subreddit, which contains some misspelled trick names.
To try this out, run the following command from the prodigy-recipes repo:
prodigy ner.fuzzy.manual tricks blank:en example-datasets/skate_reddit.jsonl example-patterns/patterns_skateboarding_tricks-TRICK.jsonl --label TRICK -F ner/ner_fuzzy_manual.py
This will bring up Prodigy NER manual UI with the names of the tricks pre highlighted, including the "misspelled" ones!
Please note that spaczz
FuzzyMatcher
used in this recipe does not resolve multiple overlapping matches. To support overlapping matches please check spaczz
SpaczzRuler
.
We highly recommend checking spaczz
documentation for more options to optimize your pattern matching further.
Links:
-
spaczz
documentation GitHub - gandersen101/spaczz: Fuzzy matching and more functionality for spaCy. -
ner.fuzzy.manual
example recipe: prodigy-recipes/ner_fuzzy_manual.py at master · explosion/prodigy-recipes · GitHub - example patterns with skateboarding tricks: prodigy-recipes/patterns_skateboarding_tricks-TRICK.jsonl at master · explosion/prodigy-recipes · GitHub
- skateboarding subreddit corpus: prodigy-recipes/skate_reddit.jsonl at master · explosion/prodigy-recipes · GitHub