What’s the best approach to annotate and train on PERSON and ORG entities.
I have a large set of names of persons and organisations mixed together from different countries.
My set includes a name-address field, name field and address field (no surrounding context).
Which one can I use best name-address or name only?
Must the form of address and legal form be included in the annotation?
Every recommendation is particularly appreciated.
Example:
A. Smith
Apple
Dr. Smith
Microsoft
Microsoft corporation
Mr. and Mrs. Smith
Apple Inc.
Bins and Sons
Michael E. Andrade
Rolfson Ltd
Mrs Elvira J. Wright
Mme Kari Ratte
Muse Kharlamova
Alexander Larionova
Fabryka Samochodow Malolitrazowych
Mai Thị Xuan Trang
Magnitogorsk Iron and Steel Works
The person is represented with or without a form of address (Mr, Mrs, Mme, Dhr...).
The organisation is represented with or without a legal form (Ltd, NV, LLP , Inc, S/A, Spolka z ograniczona odpowiedzialnoscia, Spolka akcyjna..)
Hi! There's no "true" definitive answer for this – it all comes down to what's consistent and what you want the model to predict and what's most useful in your application. Based on that, you can come up with your annotation scheme and make sure your data is annotated consistently.
A model will typically have an easier time predicting spans with clear boundaries that can be inferred from the context. So if you can design your annotation guidelines in a way that makes it easier for the model to learn, you'll likely see better results.
If I remember correctly, the Onto Notes 5 corpus (which is what spaCy's English models are trained on), annotates person names without titles and company names with the legal form (but not "corporation" in "Microsoft corporation"). But you might also want to check out some annotation manuals for NER to see what common guidelines are here.
We started with ner.teach and model en_core_web_lg on the names field and it looks promising.
Our goal is to distinguish between individuals and organizations as well as possible.
Okay, then you probably want to make sure that your annotations consistent with what the model was trained on (e.g. the same conventions as the Onto Notes 5 data). Otherwise, you may end up with worse results because you're constantly "fighting" the existing weights.
If you're updating an existing model, that's totally fine and you can collect more annotations. I was just trying to ask your initial question: if you're trying to decide what to label and how to annotate PERSON and ORG, and you're updating an existing model, you should follow the same guidelines as the dataset the model was traiend on.