Annotating / training against inconsistent PERSON entities

stevenbrent · July 10, 2018, 8:19pm

This might be a subjective matter, but what’s the best approach to annotation and training on PERSON entities where the person is represented by their surname only, and it might be prefixed in a variety of ways? Example:

Smith
Dr. Smith
Ms. Smith

honnibal · July 10, 2018, 8:55pm

I’d advise following the annotation standards that are in spaCy’s pre-trained models, which don’t have the title as part of the name. So, in a phrase like “Dr. Smith”, the title “Dr.” would be outside the entity, and the entity would be “Smith”.

The advantage of following this convention is that you’ll be able to use the pre-trained models. Trying to fight the annotation standard they’re already trained with on this will mean you’ll need a lot of training data, as the models will start out very confident that titles are outside the entities.

The reason the corpora decided on this annotation standard is that there’s a continuum from things which are very clearly titles (Mr., Dr., etc) through things which are arguably titles (Captain, Professor), to things which aren’t titles but look quite like them (President, Judge, Senator). The policy of excluding the title makes the task easier, both for the annotator and the model.

If your downstream application needs the titles to be part of the span, I would recommend having a rule-based post-process to adjust the boundaries. @ines provides some example code for that in this thread: finding patterns with ner.teach

stevenbrent · July 11, 2018, 2:00pm

Thanks for the guidance! I have another thread around here where I was asking something similar around company names, which can also take a number of different forms (e.g. BizCorp, BizCorp LLC, etc). It’s an interesting challenge posed by the particular data I’m working with in that the same entities can appear in a variety of forms and I need to figure out how to annotate / train against this without sacrificing accuracy or recall.

honnibal · July 12, 2018, 9:14am

This is one of the reasons why we advocate for semi-automatic processes. Enforcing these policies consistently is a chore for human annotators, and when they mess up the model thinks the example is deeply significant, and tries to find a solution that accommodates the surprising annotations. Current ML algorithms aren’t very good at ignoring outliers.

If you use a semi-automatic process (such as ner.teach or ner.make-gold) you can let the model inform the annotations, which makes it easier to keep the policy consistent. If the model is already annotating the LLC as part of the company name, you go with that. The model will be good at remembering all those details of the annotation scheme. We want the human to focus on significant, obvious errors, ideally ones a human would never make.

Topic		Replies	Views
Annotating / training against PERSON and ORG entities usage , ner , solved	6	432	September 30, 2020
Extending existing entity type with patterns usage , ner , solved , best-practices	4	1658	June 27, 2018
NER Training for Corporate Names ner , best-practices	22	11418	September 4, 2019
Spacy NER Training, How to proceed name placeholders in a text ner , spacy	1	477	January 21, 2021
Taking meta data into consideration when training/extracting Named Entities? usage , ner , solved	5	1029	March 20, 2018

Annotating / training against inconsistent PERSON entities

Related topics