Annotating / training against PERSON and ORG entities

As a newbie i would like some advice.

What’s the best approach to annotate and train on PERSON and ORG entities.
I have a large set of names of persons and organisations mixed together from different countries.
My set includes a name-address field, name field and address field (no surrounding context).

  1. Which one can I use best name-address or name only?
  2. Must the form of address and legal form be included in the annotation?

Every recommendation is particularly appreciated.

Example:

  • A. Smith
  • Apple
  • Dr. Smith
  • Microsoft
  • Microsoft corporation
  • Mr. and Mrs. Smith
  • Apple Inc.
  • Bins and Sons
  • Michael E. Andrade
  • Rolfson Ltd
  • Mrs Elvira J. Wright
  • Mme Kari Ratte
  • Muse Kharlamova
  • Alexander Larionova
  • Fabryka Samochodow Malolitrazowych
  • Mai Thị Xuan Trang
  • Magnitogorsk Iron and Steel Works

The person is represented with or without a form of address (Mr, Mrs, Mme, Dhr...).
The organisation is represented with or without a legal form (Ltd, NV, LLP , Inc, S/A, Spolka z ograniczona odpowiedzialnoscia, Spolka akcyjna..)

Hi! There's no "true" definitive answer for this – it all comes down to what's consistent and what you want the model to predict and what's most useful in your application. Based on that, you can come up with your annotation scheme and make sure your data is annotated consistently.

A model will typically have an easier time predicting spans with clear boundaries that can be inferred from the context. So if you can design your annotation guidelines in a way that makes it easier for the model to learn, you'll likely see better results.

If I remember correctly, the Onto Notes 5 corpus (which is what spaCy's English models are trained on), annotates person names without titles and company names with the legal form (but not "corporation" in "Microsoft corporation"). But you might also want to check out some annotation manuals for NER to see what common guidelines are here.

Hi,

We started with ner.teach and model en_core_web_lg on the names field and it looks promising.
Our goal is to distinguish between individuals and organizations as well as possible.

Okay, then you probably want to make sure that your annotations consistent with what the model was trained on (e.g. the same conventions as the Onto Notes 5 data). Otherwise, you may end up with worse results because you're constantly "fighting" the existing weights.

What do you propose to do. Create a whole new model based on our own annotations?

If you're updating an existing model, that's totally fine and you can collect more annotations. I was just trying to ask your initial question: if you're trying to decide what to label and how to annotate PERSON and ORG, and you're updating an existing model, you should follow the same guidelines as the dataset the model was traiend on.

Clear, many thanks !

1 Like