Annotating / training against inconsistent PERSON entities

I’d advise following the annotation standards that are in spaCy’s pre-trained models, which don’t have the title as part of the name. So, in a phrase like “Dr. Smith”, the title “Dr.” would be outside the entity, and the entity would be “Smith”.

The advantage of following this convention is that you’ll be able to use the pre-trained models. Trying to fight the annotation standard they’re already trained with on this will mean you’ll need a lot of training data, as the models will start out very confident that titles are outside the entities.

The reason the corpora decided on this annotation standard is that there’s a continuum from things which are very clearly titles (Mr., Dr., etc) through things which are arguably titles (Captain, Professor), to things which aren’t titles but look quite like them (President, Judge, Senator). The policy of excluding the title makes the task easier, both for the annotator and the model.

If your downstream application needs the titles to be part of the span, I would recommend having a rule-based post-process to adjust the boundaries. @ines provides some example code for that in this thread: finding patterns with ner.teach

1 Like