Best strategies to annotate long entities.

Hi!
First of all thank you for developing such amazing tool. I am learning a lot with it.

Following my question, I would like to ask you which are the best or most recommended strategies to annotate the following cases:

1 - Having long addresses such as "4455 Landing Lange, APT 4, Louisville, KY 40018-1234", would it be better to use more than one entity (for instance "ZIP", "FLOOR" , "NAME_STREET" ) rather than just one as ADD_NAME?.

2- I understand that the label to use should be according to the context, however in some cases I have this doubt:

  • "Antonio will release the next feature of the product"
  • "Ryako S.L will release the next feature of the product".
    In both cases the context is almost the same, but the first is PER and the second is ORG. Should I annotate based on this or both cases should be the same ( for example "SUBJECT" and use later a classifier to figure out if it is PER or ORG) ?
1 Like

hi @polodealvarado!

Thanks for your questions!

This is really tricky and I can't give a guaranteed solution -- usually, the best is to try out an idea, see if it works, then go there.

However, my gut would tend to prefer break up the entities into ZIP, FLOOR, etc., that are naturally forming components of the address. If you choose only one (e.g., ADD_NAME), I would be concerned your model may start to get confused from the mixed signals/patterns that can make up long addresses.

Also, are your addresses only US only? Or other countries that may have different structure/rules for addresses? If it were only US, then I'd be more inclined to separate them out as there would be only one convention (e.g., ZIP, STATE is a consistent entity in all expected contexts).

With this in mind, I've found a few old posts that have thought about addresses:

You may also find these posts to be helpful:

Like the posts mention, since addresses typically follow many conventions, I would also consider how/if patterns/rules could help you either by pre-highlight spans or creating your own entity ruler that works with your ner model.

How would you be using your model? For example, if you know that your final model doesn't need to separate PERSON vs. ORG, then I can see the point in simplifying your entity to "SUBJECT". But if you need your model to separate the two, then you may need to split them out.

I can understand the concern that the model may confuse PERSON vs ORG based on the context, but you may find that the context does change with sufficient examples that the model can toy out some of those differences (e.g., you rarely see a person "release next feature of the product" since most of the time it is an organization, not a person releasing it).

Hope this helps!

1 Like