Yes, that makes sense!
The en_core_web_lg
model already has a pretty solid concept of what a PERSON
is, so it'll take a while to really move it towards your definition of a person. Depending on what you're trying to do, you might want to consider a combination of the statistical model and a rule-based approach: after all, the model is already pretty good at finding persons, and you can easily improve it even more using Prodigy. The only difference is that it doesn't include the titles like "Dr.".
However, if you use spaCy, you'll always have a reference to the tokens surrounding the entitiy. So you could write a custom pipeline component that runs after the regular entity recognizer, checks if the previous token is a title and if so, replaces the entity with a new span that includes the title.
Here's a minimal proof of concept:
import spacy
from spacy.tokens import Span
nlp = spacy.load('en_core_web_sm') # or any other model
doc = nlp(u"Dr Pulse, Dr Borer; Dr Cure, Dr Gore") # test regular entities
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Dr Cure', 'PERSON'), ('Gore', 'PERSON')]
def add_titles_to_entities(doc):
# this component takes a doc, merges the entities and returns it
titles = ['dr', 'mr', 'dr.', 'mr.'] # etc.
new_ents = [] # collect the new updated entities for the doc here
for ent in doc.ents:
if ent.start == 0 or ent.label_ != 'PERSON':
# there's no token before the entity or it's not a person
new_ents.append(ent) # keep it and exit early
continue
prev_token = doc[ent.start - 1]
if prev_token.lower_ in titles:
# previous token text is in list of titles, so create new entity with title
ent_with_title = Span(doc, ent.start - 1, ent.end, label=ent.label)
new_ents.append(ent_with_title)
else: # keep regular entitiy
new_ents.append(ent)
doc.ents = new_ents # overwrite the entities with the new list
return doc
# add component to the pipeline – make sure to add it last or after='ner'
nlp.add_pipe(add_titles_to_entities, last=True)
doc = nlp(u"Dr Pulse, Dr Borer; Dr Cure, Dr Gore") # process text again
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Dr Cure', 'PERSON'), ('Dr Gore', 'PERSON')]
In this example, you can also see that there's still room for improvement: "Borer" isn't recognised as a person at all. But it's usually a lot easier to improve the existing PERSON
category and just teach the model to recognise more persons, than it is to change its entire definition and policy, learned from a corpus of over 2 million words.
This was likely a result of this pattern:
{"label": "PERSON", "pattern" :[{"LOWER": "dr"}]}