Best strategies to annotate long entities.

hi @polodealvarado!

Thanks for your questions!

This is really tricky and I can't give a guaranteed solution -- usually, the best is to try out an idea, see if it works, then go there.

However, my gut would tend to prefer break up the entities into ZIP, FLOOR, etc., that are naturally forming components of the address. If you choose only one (e.g., ADD_NAME), I would be concerned your model may start to get confused from the mixed signals/patterns that can make up long addresses.

Also, are your addresses only US only? Or other countries that may have different structure/rules for addresses? If it were only US, then I'd be more inclined to separate them out as there would be only one convention (e.g., ZIP, STATE is a consistent entity in all expected contexts).

With this in mind, I've found a few old posts that have thought about addresses:

You may also find these posts to be helpful:

Like the posts mention, since addresses typically follow many conventions, I would also consider how/if patterns/rules could help you either by pre-highlight spans or creating your own entity ruler that works with your ner model.

How would you be using your model? For example, if you know that your final model doesn't need to separate PERSON vs. ORG, then I can see the point in simplifying your entity to "SUBJECT". But if you need your model to separate the two, then you may need to split them out.

I can understand the concern that the model may confuse PERSON vs ORG based on the context, but you may find that the context does change with sufficient examples that the model can toy out some of those differences (e.g., you rarely see a person "release next feature of the product" since most of the time it is an organization, not a person releasing it).

Hope this helps!

2 Likes