This might be a subjective matter, but what’s the best approach to annotation and training on PERSON entities where the person is represented by their surname only, and it might be prefixed in a variety of ways? Example:
- Smith
- Dr. Smith
- Ms. Smith
This might be a subjective matter, but what’s the best approach to annotation and training on PERSON entities where the person is represented by their surname only, and it might be prefixed in a variety of ways? Example:
I’d advise following the annotation standards that are in spaCy’s pre-trained models, which don’t have the title as part of the name. So, in a phrase like “Dr. Smith”, the title “Dr.” would be outside the entity, and the entity would be “Smith”.
The advantage of following this convention is that you’ll be able to use the pre-trained models. Trying to fight the annotation standard they’re already trained with on this will mean you’ll need a lot of training data, as the models will start out very confident that titles are outside the entities.
The reason the corpora decided on this annotation standard is that there’s a continuum from things which are very clearly titles (Mr., Dr., etc) through things which are arguably titles (Captain, Professor), to things which aren’t titles but look quite like them (President, Judge, Senator). The policy of excluding the title makes the task easier, both for the annotator and the model.
If your downstream application needs the titles to be part of the span, I would recommend having a rule-based post-process to adjust the boundaries. @ines provides some example code for that in this thread: finding patterns with ner.teach
Thanks for the guidance! I have another thread around here where I was asking something similar around company names, which can also take a number of different forms (e.g. BizCorp, BizCorp LLC, etc). It’s an interesting challenge posed by the particular data I’m working with in that the same entities can appear in a variety of forms and I need to figure out how to annotate / train against this without sacrificing accuracy or recall.
This is one of the reasons why we advocate for semi-automatic processes. Enforcing these policies consistently is a chore for human annotators, and when they mess up the model thinks the example is deeply significant, and tries to find a solution that accommodates the surprising annotations. Current ML algorithms aren’t very good at ignoring outliers.
If you use a semi-automatic process (such as ner.teach
or ner.make-gold
) you can let the model inform the annotations, which makes it easier to keep the policy consistent. If the model is already annotating the LLC as part of the company name, you go with that. The model will be good at remembering all those details of the annotation scheme. We want the human to focus on significant, obvious errors, ideally ones a human would never make.