US address detection using Prodigy

I am new to prodigy. I would like to know:

  1. How do I detect addresses using NER and spaCy. NER detects names, organizations, dates, book titles, etc. While there are a lot of pre-trained models for names, there are none for addresses.
    In the same way is it possible to detect US addresses( street names, cities, zip codes, country) using prodigy. I consider an address as the whole thing (street, city, zip code, country). Not interested in extracting entities like street, city, zip code, or country.
  2. If we say that this is the address, how can we say that the detected address is Real or Fake. ( this is the address or this is not an address)
    My name is Lalitha, I live in 2574 EAST 23RD STREE, CHATTANOOGA, TN 37404, United States
    My name is Lalitha Name, I live in 2574 EAST 23RD STREE, CHATTANOOGA, TN 37404, United States ADDRESS
    Thank you

Hi @KSK!

Thanks for your question! This is very similar to a recent spaCy GitHub issue.

A few initial questions:

  1. What is your data? Are the addresses always complete but just inconsistent like missing abbreviations or some use 9 zip while others use 5 zip? Or are the addresses sometimes complete and other times incomplete (e.g., sometimes in manually written notes)?
  2. Do you have any ground truth? For your second question, it's very hard to know correct or not if you don't have either known rules the addresses (e.g., USPS rules for addresses) or ground truth examples of what they should look like. Like Peter's recommendation in GitHub issue, without either of these you may be better finding an address validator API.
  3. Your example confuses me because even the output doesn't complete the word STREET (instead STREE). Was the only difference in adding the last name?
  4. Can you elaborate more on your use case? What is success in your opinion? Have you considered any benchmarks like existing solutions?
  5. What if an address has an unintentional error -- e.g., they accidentally put St. instead of Ave.? That's not technically fake and/or not invalid. How would you expect to handle these cases?

Here are a few resources that could help:

Your question seems to focus more on spaCy rather than Prodigy. Prodigy is best used when thinking about how you would elicit annotations (feedback) from users. If you haven't seen it, I would recommend watching @koaning's excellent data deduplication video:

He'll show you to create custom recipes to identify duplicated individual users (e.g., addresses + personal info). While it is slightly a different task, it's a great way to think of framing user annotation tasks as an A/B problem. You can find the accompanying code in this GitHub repo.

Hope this helps and let us know if you have other questions.

1 Like