Annotating / training against PERSON and ORG entities

Robert · September 27, 2020, 3:49am

As a newbie i would like some advice.

What’s the best approach to annotate and train on PERSON and ORG entities.
I have a large set of names of persons and organisations mixed together from different countries.
My set includes a name-address field, name field and address field (no surrounding context).

Which one can I use best name-address or name only?
Must the form of address and legal form be included in the annotation?

Every recommendation is particularly appreciated.

Example:

A. Smith
Apple
Dr. Smith
Microsoft
Microsoft corporation
Mr. and Mrs. Smith
Apple Inc.
Bins and Sons
Michael E. Andrade
Rolfson Ltd
Mrs Elvira J. Wright
Mme Kari Ratte
Muse Kharlamova
Alexander Larionova
Fabryka Samochodow Malolitrazowych
Mai Thị Xuan Trang
Magnitogorsk Iron and Steel Works

The person is represented with or without a form of address (Mr, Mrs, Mme, Dhr...).
The organisation is represented with or without a legal form (Ltd, NV, LLP , Inc, S/A, Spolka z ograniczona odpowiedzialnoscia, Spolka akcyjna..)

ines · September 28, 2020, 6:42pm

Hi! There's no "true" definitive answer for this – it all comes down to what's consistent and what you want the model to predict and what's most useful in your application. Based on that, you can come up with your annotation scheme and make sure your data is annotated consistently.

A model will typically have an easier time predicting spans with clear boundaries that can be inferred from the context. So if you can design your annotation guidelines in a way that makes it easier for the model to learn, you'll likely see better results.

If I remember correctly, the Onto Notes 5 corpus (which is what spaCy's English models are trained on), annotates person names without titles and company names with the legal form (but not "corporation" in "Microsoft corporation"). But you might also want to check out some annotation manuals for NER to see what common guidelines are here.

Robert · September 29, 2020, 3:39am

Hi,

We started with ner.teach and model en_core_web_lg on the names field and it looks promising.
Our goal is to distinguish between individuals and organizations as well as possible.

ines · September 29, 2020, 8:38am

Okay, then you probably want to make sure that your annotations consistent with what the model was trained on (e.g. the same conventions as the Onto Notes 5 data). Otherwise, you may end up with worse results because you're constantly "fighting" the existing weights.

Robert · September 29, 2020, 9:40am

What do you propose to do. Create a whole new model based on our own annotations?

ines · September 29, 2020, 12:52pm

If you're updating an existing model, that's totally fine and you can collect more annotations. I was just trying to ask your initial question: if you're trying to decide what to label and how to annotate PERSON and ORG, and you're updating an existing model, you should follow the same guidelines as the dataset the model was traiend on.

Robert · September 30, 2020, 5:39am

Clear, many thanks !

Topic		Replies	Views
Annotating / training against inconsistent PERSON entities ner	3	780	July 12, 2018
Spacy NER Training, How to proceed name placeholders in a text ner , spacy	1	469	January 21, 2021
Address entity recognition from a resume/CV ner , best-practices	2	2398	January 18, 2019
Overlapping NER usage , ner , spacy	2	337	July 1, 2021
Improving on spacy's existing NER entities ner	1	662	December 5, 2019

Annotating / training against PERSON and ORG entities

Related topics