Address entity recognition from a resume/CV

I want to parse freeform street/postal addresses out of resumes/CVs. Addresses are not regular. This means that regular expressions are out. Therefore I want to treat it as an entity recognition problem and to annotate resumes.

I have the following concerns.

Concern #1: Ines mentioned that NER works best for clearly defined categories of “things” that occur in similar contexts.
In my opinion, address is a clearly defined category of things. (For example, It is not such ambiguous category like “CRIME VICTIM”. It easy to provide definition of Address.)

I am not sure that addresses occur in similar contexts within resumes. Of course, addresses are always part of “Personal information” section. It seems to be quite hard for NER model to recognize similarirtes in contexts provided within Appendix A.

Question #1: Is Concern #1 makes sense? Is the absence of good context is a blocker for treating the problem as the entity recognition problem?

Concern #2: Amount of tokens is not constant. The amount of tokens is vary to a high degree.

Concern #3: GPE seems to be the most closest entity to Address entity, but the degree of matching is quite low. It means that training of a new entity type is required.

Question #2 Am I right about the training of a new entity type?

Question #3 (general question): Let us suppose that we have sufficient amount of training data. Please provide your thoughts about address entity recognition. Will it work in general?

Appendix A
Bob Brown
4-3-9-2267 Tsukuda, Chuo-ku, Tokyo, Japan 123-4567 
Phone: 123-4567-8912

12 Sycamore Circle • Stony Brook, NY 12345 • (123) 456-7891 Home • (123) 456-7891 Cell •

3333 NE 222TH STREET • VANCOUVER, WA 99999
(503) 515-6223 •

4)      Alice S. Wick, Jr.
888 Everhill							99 Vickie Drive
Peachtree City, Georgia 44444				Napanoch, New York 22222
Phone:(333) 777-9999                                                         (222) 444-5555

Addresses are definitely common NER categories and I agree with your analysis that it’s a clearly defined category. You just want to make sure that you’re clear about what you are considering an address – is it the whole thing (street, city, post code, country), or just the street? How does your scheme handle things like “c/o XY”? And so on.

One thing to consider when talking about local context: it includes the surrounding words, but also the entity tokens itself. So if your addresses follow some kind of pattern (“X Y Street”, “5 X Y” etc.), this also means it’d be easier to recognize. In your case, that might be a bit tricky, because you seem to be dealing with a variety of international addresses, including anglicised versions (e.g. Japanese) that aren’t always consistent.

GPE stands for “geopolitical entity”, meaning everything with a government / governing body. So Japan and Tokyo would be considered a GPE, while “the Bay Area” wouldn’t (because it’s just an area).

If you’re working with a pre-trained model that uses an annotation scheme like this and suddenly try to teach it a very different interpretation of GPE, this can easily lead to a lot of problems. To override and adjust the existing weights to fit your definition you’d easily need as many examples as the original training corpus – and in that case, it’d make much more sense to just train from scratch.

If you’re adding new entity types, you definitely want to avoid redefining existing definitions in the label scheme, or overlapping types. But one thing you could try is use the existing categories that apply (GPE, PERSON), more types for the individual address parts (like STREET_NAME) – and then use rules to put them together to form a full address.

It’d likely work better for extracting addresses from natural language text, rather than from fairly isolated blocks that only contain personal information. For example, if you’re analysing cover letters and you want to detect semi-vague mentions of locations – like “I worked for Google [ORG] in Zurich [GPE]” vs. “I worked for Zurich [ORG] in Berlin [GPE]”.

It’s possible that a clever, rule-based approach will outperform any statistical model for your use case. There are 195 countries and apparently around 100,000 (reasonably-sized?) cities in the world. Even if it were 10 times as much, matching those in your data really isn’t much of a challenge anymore for a machine. Combinations of and with numbers are really easy to detect, too, even if with simple regular expressions. Once you’ve identified those, you can use another set of rules to put it all together. It’s not as sexy as a neural network model, but in the end, what you care about is the results, right?

To try this and find out what works best, you could start and annotate a few hundred representative examples manually to create your evaluation set. Then label some training data, train your model and evaluate it on the set. Then write some rules and run the same evaluation.

I am considering an address as the whole thing (street, city, post code, country). So I am currently not interested in extracting of more granular entities like street, city, post code, country.

I decieded to initially focus on extracting of address from typically located at the beginning of a resume (“Personal information” section). My understanding is that such addresses do not contain things like “c/o XY”. So it seems that I will not face such case.

Things like “c/o XY” are part of postal addresses.

One example of a postal address from a real resume (References section)

Mr. Thomas (Tom) Wipple - (CEO - Vista Energy International)
c/o Vista Energy International
Office: 1-360--* - ext: 502
Mobile: 1-425--*
Email Address: ac****

Yes, the “Address”-like interpretation of GPE is quite different from the initial interpretation.

Thank you for the explanation.

I also suppose that the information provided by the context should be very helpful for the named entity recognition process.

Yes, I will need to prepare a large gazette to cover countries, cities and possibly some other categories.

Thank you for the providing your thoughts on that. I need to try that.

Yes, you are right.

Yes, that makes sense. Thank you very much for all of the above thoughts.