Address entity recognition from a resume/CV

Addresses are definitely common NER categories and I agree with your analysis that it's a clearly defined category. You just want to make sure that you're clear about what you are considering an address – is it the whole thing (street, city, post code, country), or just the street? How does your scheme handle things like "c/o XY"? And so on.

One thing to consider when talking about local context: it includes the surrounding words, but also the entity tokens itself. So if your addresses follow some kind of pattern ("X Y Street", "5 X Y" etc.), this also means it'd be easier to recognize. In your case, that might be a bit tricky, because you seem to be dealing with a variety of international addresses, including anglicised versions (e.g. Japanese) that aren't always consistent.

GPE stands for "geopolitical entity", meaning everything with a government / governing body. So Japan and Tokyo would be considered a GPE, while "the Bay Area" wouldn't (because it's just an area).

If you're working with a pre-trained model that uses an annotation scheme like this and suddenly try to teach it a very different interpretation of GPE, this can easily lead to a lot of problems. To override and adjust the existing weights to fit your definition you'd easily need as many examples as the original training corpus – and in that case, it'd make much more sense to just train from scratch.

If you're adding new entity types, you definitely want to avoid redefining existing definitions in the label scheme, or overlapping types. But one thing you could try is use the existing categories that apply (GPE, PERSON), more types for the individual address parts (like STREET_NAME) – and then use rules to put them together to form a full address.

It'd likely work better for extracting addresses from natural language text, rather than from fairly isolated blocks that only contain personal information. For example, if you're analysing cover letters and you want to detect semi-vague mentions of locations – like "I worked for Google [ORG] in Zurich [GPE]" vs. "I worked for Zurich [ORG] in Berlin [GPE]".

It's possible that a clever, rule-based approach will outperform any statistical model for your use case. There are 195 countries and apparently around 100,000 (reasonably-sized?) cities in the world. Even if it were 10 times as much, matching those in your data really isn't much of a challenge anymore for a machine. Combinations of and with numbers are really easy to detect, too, even if with simple regular expressions. Once you've identified those, you can use another set of rules to put it all together. It's not as sexy as a neural network model, but in the end, what you care about is the results, right?

To try this and find out what works best, you could start and annotate a few hundred representative examples manually to create your evaluation set. Then label some training data, train your model and evaluate it on the set. Then write some rules and run the same evaluation.