Address extraction: NER or Spancat?

Hi I am looking into creating something similar to US Extract API documentation - Smarty
which can detect potential addresses in documents:

		"There are addresses everywhere.
		1109 Ninth 85007
		Smarty can find them.
		3785 Las Vegs Av.
		Los Vegas, Nevada
		That is all."

It should be able to find the first line of the address and its separate entities, city, state, zipcode, country. But the addresses do not need to be validated.

I have tried doing this with rules-based approach and regex and GitHub - datamade/usaddress: a python library for parsing unstructured United States address strings into address components, but it feels like there are just too many edge cases. Potential addresses may have misspellings, entities may be out of order, and entities like state may be missing:

 "some random next in front APT #11 1234 Somestreet ST CITY_FIELD Las Angelos, 90001 more random next in back"

I was looking into spaCy and then found many tutorials for NER and more recent mention of Spancat.

Any advice on which rabbit hole to go down would be appreciated. It sounds like Spancat but I have not done NLP before, so not sure. Thank you.

Hi @azidahaka,

Thanks for your question and welcome to the Prodigy community :wave:

That's a good question. In general, I would typically err more towards NER than Spancat for address extraction. First, addresses tend to have well defined boundaries. That is, it's pretty clear what is or isn't part of the address. The challenge is thinking if you want to break it down Secondly, addresses tend to need overlapping entities, which are one of the benefits of spancat.

However, it would be interesting to train with spancat too.

I would strongly recommend using the ner recipes like ner.manual and ner.correct simply because you can use annotations from ner to train either ner or spancat; however, you can't necessarily use spans recipes' annotations to train ner, only spancat. And since you don't expect any overlapping entities (right),

But where I think you'll get the best performance is some combination of ML + rules. I previously wrote a summary of spaCy and Prodigy resources on address extraction:

I really like this post that goes through spaCy to train a ner component and then add rules near the end. Prodigy could come for the annotations.

Since the user goes into managing a spaCy config file, I'd encourage you to use data-to-spacy and train your model with spacy train rather than prodigy train, which is simply a wrapper for spacy train. It will require a little knowledge of spaCy and config files; but hopefully you can copy and paste. Be sure to use spacy debug config or spacy debug data to help make sure your data and config files are correct.

Also - thinking outside the box - I think this "messy" address extraction would be an interesting task for zero/few shot learning with LLM's. We just integrated LLM components into our new v1.12 alpha, which you would need to install.

Granted, you'd need access to and pay for OpenAI's API, but this would speed up annotation in a few ways like creating synthetic data. For example, if you run:

python3 -m prodigy terms.openai.fetch "US addresses embedded in text" ./output/addresses.jsonl

It would generate synthetic data like this:

{"text":"On vacation I visited my grandmother who lives at 995 Lexington Avenue, Aurora, IL 60502","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"The office is located at 425 Main Street, San Antonio, TX 78205.","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"3. I went to visit my cousin in 1112 1st Avenue, San Francisco, CA.","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"I visited the Empire State Building at 350 Fifth Avenue in New York City this summer.","meta":{"openai_query":"US addresses embedded in text"}}
{"text":"My best friend lives at 23 Mount Street, Memphis, TN 38119","meta":{"openai_query":"US addresses embedded in text"}}

What's nice is I bet you could do some prompt engineering to generate some rare types of addresses like apt/unit number address or even mention misspellings. I'm sort of hacking the terms.openai.fetch recipe as it's intended for terms/entities, not short sentences. However, from the test above it looks great.

Alternatively, you could try the zero/few shot training recipes too.

Again, completely ignore the OpenAI/LLM approach if you want; but I think address extraction with LLM's could yield some efficiency gains in the workflow.

Hope this helps!

1 Like