"There are addresses everywhere.
1109 Ninth 85007
Smarty can find them.
3785 Las Vegs Av.
Los Vegas, Nevada
That is all."
It should be able to find the first line of the address and its separate entities, city, state, zipcode, country. But the addresses do not need to be validated.
Thanks for your question and welcome to the Prodigy community
That's a good question. In general, I would typically err more towards NER than Spancat for address extraction. First, addresses tend to have well defined boundaries. That is, it's pretty clear what is or isn't part of the address. The challenge is thinking if you want to break it down Secondly, addresses tend to need overlapping entities, which are one of the benefits of spancat.
However, it would be interesting to train with spancat too.
I would strongly recommend using the ner recipes like ner.manual and ner.correct simply because you can use annotations from ner to train either ner or spancat; however, you can't necessarily use spans recipes' annotations to train ner, only spancat. And since you don't expect any overlapping entities (right),
But where I think you'll get the best performance is some combination of ML + rules. I previously wrote a summary of spaCy and Prodigy resources on address extraction:
I really like this post that goes through spaCy to train a ner component and then add rules near the end. Prodigy could come for the annotations.
Since the user goes into managing a spaCy config file, I'd encourage you to use data-to-spacy and train your model with spacy train rather than prodigy train, which is simply a wrapper for spacy train. It will require a little knowledge of spaCy and config files; but hopefully you can copy and paste. Be sure to use spacy debug config or spacy debug data to help make sure your data and config files are correct.
Also - thinking outside the box - I think this "messy" address extraction would be an interesting task for zero/few shot learning with LLM's. We just integrated LLM components into our new v1.12 alpha, which you would need to install.
Granted, you'd need access to and pay for OpenAI's API, but this would speed up annotation in a few ways like creating synthetic data. For example, if you run:
python3 -m prodigy terms.openai.fetch "US addresses embedded in text" ./output/addresses.jsonl
It would generate synthetic data like this:
{"text":"On vacation I visited my grandmother who lives at 995 Lexington Avenue, Aurora, IL 60502","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"The office is located at 425 Main Street, San Antonio, TX 78205.","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"3. I went to visit my cousin in 1112 1st Avenue, San Francisco, CA.","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"I visited the Empire State Building at 350 Fifth Avenue in New York City this summer.","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"My best friend lives at 23 Mount Street, Memphis, TN 38119","meta":{"openai_query":"US addresses embedded in text"}}
What's nice is I bet you could do some prompt engineering to generate some rare types of addresses like apt/unit number address or even mention misspellings. I'm sort of hacking the terms.openai.fetch recipe as it's intended for terms/entities, not short sentences. However, from the test above it looks great.
Alternatively, you could try the zero/few shot training recipes too.
Again, completely ignore the OpenAI/LLM approach if you want; but I think address extraction with LLM's could yield some efficiency gains in the workflow.