US address detection using Prodigy

Hi @KSK!

Thanks for your question! This is very similar to a recent spaCy GitHub issue.

A few initial questions:

  1. What is your data? Are the addresses always complete but just inconsistent like missing abbreviations or some use 9 zip while others use 5 zip? Or are the addresses sometimes complete and other times incomplete (e.g., sometimes in manually written notes)?
  2. Do you have any ground truth? For your second question, it's very hard to know correct or not if you don't have either known rules the addresses (e.g., USPS rules for addresses) or ground truth examples of what they should look like. Like Peter's recommendation in GitHub issue, without either of these you may be better finding an address validator API.
  3. Your example confuses me because even the output doesn't complete the word STREET (instead STREE). Was the only difference in adding the last name?
  4. Can you elaborate more on your use case? What is success in your opinion? Have you considered any benchmarks like existing solutions?
  5. What if an address has an unintentional error -- e.g., they accidentally put St. instead of Ave.? That's not technically fake and/or not invalid. How would you expect to handle these cases?

Here are a few resources that could help:

Your question seems to focus more on spaCy rather than Prodigy. Prodigy is best used when thinking about how you would elicit annotations (feedback) from users. If you haven't seen it, I would recommend watching @koaning's excellent data deduplication video:

He'll show you to create custom recipes to identify duplicated individual users (e.g., addresses + personal info). While it is slightly a different task, it's a great way to think of framing user annotation tasks as an A/B problem. You can find the accompanying code in this GitHub repo.

Hope this helps and let us know if you have other questions.

1 Like