US address detection using Prodigy

ryanwesslen · July 5, 2022, 11:14pm

Thanks for your question! This is very similar to a recent spaCy GitHub issue.

A few initial questions:

What is your data? Are the addresses always complete but just inconsistent like missing abbreviations or some use 9 zip while others use 5 zip? Or are the addresses sometimes complete and other times incomplete (e.g., sometimes in manually written notes)?
Do you have any ground truth? For your second question, it's very hard to know correct or not if you don't have either known rules the addresses (e.g., USPS rules for addresses) or ground truth examples of what they should look like. Like Peter's recommendation in GitHub issue, without either of these you may be better finding an address validator API.
Your example confuses me because even the output doesn't complete the word STREET (instead STREE). Was the only difference in adding the last name?
Can you elaborate more on your use case? What is success in your opinion? Have you considered any benchmarks like existing solutions?
What if an address has an unintentional error -- e.g., they accidentally put St. instead of Ave.? That's not technically fake and/or not invalid. How would you expect to handle these cases?

Here are a few resources that could help:

NER/statistical model for building an address parser This seems to be aligned with what you're thinking.
spaCy issues on How to train the NER to recognize addresses. This is an NER approach which you seem not to want but I think there's some insight you can still cain.
Rule-based/matcher using spaCy
mordecai which is a text geoparser with spaCy

Your question seems to focus more on spaCy rather than Prodigy. Prodigy is best used when thinking about how you would elicit annotations (feedback) from users. If you haven't seen it, I would recommend watching @koaning's excellent data deduplication video:

He'll show you to create custom recipes to identify duplicated individual users (e.g., addresses + personal info). While it is slightly a different task, it's a great way to think of framing user annotation tasks as an A/B problem. You can find the accompanying code in this GitHub repo.

Hope this helps and let us know if you have other questions.

Topic		Replies	Views
Address extraction: NER or Spancat? ner , spacy , spancat	1	2152	June 9, 2023
Can we train an NER to recognise some entities not learned from labelled examples, but a list of imported entities, such as names of areas, main roads, etc.? usage , ner , spacy , solved	2	575	June 21, 2020
NER model to extract addresses from text usage , ner	3	701	July 27, 2020
Prodigy to Spacy Guide ner , spacy , best-practices	4	5323	January 13, 2020
NER or PhraseMatcher? ner , spacy , best-practices	17	6091	September 20, 2018

US address detection using Prodigy

Related topics