I want to parse freeform street/postal addresses out of resumes/CVs. Addresses are not regular. This means that regular expressions are out. Therefore I want to treat it as an entity recognition problem and to annotate resumes.
I have the following concerns.
Concern #1: Ines mentioned that NER works best for clearly defined categories of “things” that occur in similar contexts.
In my opinion, address is a clearly defined category of things. (For example, It is not such ambiguous category like “CRIME VICTIM”. It easy to provide definition of Address.)
I am not sure that addresses occur in similar contexts within resumes. Of course, addresses are always part of “Personal information” section. It seems to be quite hard for NER model to recognize similarirtes in contexts provided within Appendix A.
Question #1: Is Concern #1 makes sense? Is the absence of good context is a blocker for treating the problem as the entity recognition problem?
Concern #2: Amount of tokens is not constant. The amount of tokens is vary to a high degree.
Concern #3: GPE seems to be the most closest entity to Address entity, but the degree of matching is quite low. It means that training of a new entity type is required.
Question #2 Am I right about the training of a new entity type?
Question #3 (general question): Let us suppose that we have sufficient amount of training data. Please provide your thoughts about address entity recognition. Will it work in general?
1) Bob Brown 4-3-9-2267 Tsukuda, Chuo-ku, Tokyo, Japan 123-4567 Phone: 123-4567-8912 E-mail: firstname.lastname@example.org 2)JACKIE CHAN 12 Sycamore Circle • Stony Brook, NY 12345 • (123) 456-7891 Home • (123) 456-7891 Cell •email@example.com 3) LAVIS N JOHN 3333 NE 222TH STREET • VANCOUVER, WA 99999 (503) 515-6223 • firstname.lastname@example.org 4) Alice S. Wick, Jr. 888 Everhill 99 Vickie Drive Peachtree City, Georgia 44444 Napanoch, New York 22222 Phone:(333) 777-9999 (222) 444-5555