Annotating / training against inconsistent PERSON entities

honnibal · July 10, 2018, 8:55pm

I’d advise following the annotation standards that are in spaCy’s pre-trained models, which don’t have the title as part of the name. So, in a phrase like “Dr. Smith”, the title “Dr.” would be outside the entity, and the entity would be “Smith”.

The advantage of following this convention is that you’ll be able to use the pre-trained models. Trying to fight the annotation standard they’re already trained with on this will mean you’ll need a lot of training data, as the models will start out very confident that titles are outside the entities.

The reason the corpora decided on this annotation standard is that there’s a continuum from things which are very clearly titles (Mr., Dr., etc) through things which are arguably titles (Captain, Professor), to things which aren’t titles but look quite like them (President, Judge, Senator). The policy of excluding the title makes the task easier, both for the annotator and the model.

If your downstream application needs the titles to be part of the span, I would recommend having a rule-based post-process to adjust the boundaries. @ines provides some example code for that in this thread: finding patterns with ner.teach

Topic		Replies	Views
Annotating / training against PERSON and ORG entities usage , ner , solved	6	430	September 30, 2020
Overlapping NER usage , ner , spacy	2	337	July 1, 2021
Improving on spacy's existing NER entities ner	1	662	December 5, 2019
Custom NER model usage , ner , spacy	6	1402	April 15, 2019
Spacy NER Training, How to proceed name placeholders in a text ner , spacy	1	469	January 21, 2021

Annotating / training against inconsistent PERSON entities

Related topics