Extending existing entity type with patterns

Hi,
I’m trying to let the system recognize names such as Dr xxx or Mr. xxxx. So I made some manual patterns like this:

{"label": "PERSON", "pattern" :[{"LOWER": "dr"}, {"IS_PUNCT": true, "OP": "?" }, {}]}
{"label": "PERSON", "pattern" :[{"LOWER": "mr"}, {"IS_PUNCT": true, "OP": "?" }, {}]}

and used a command like so:

 prodigy ner.teach dr_mr_names en_core_web_lg "dr_names.txt" --label PERSON --patterns person_patterns.jsonl --loader txt

where the dr_names.txt file simply contains the text from this site: http://www.u.arizona.edu/~stoddard/doctor.htm, producing a file like so:

Dr Klotz (Clots); Dr Wax|Ear, Nose and Throat
Dr Pulse, Dr Borer; Dr Cure, Dr Gore   Emergency Medicine

When I run the recipe, the system suggests names alright, but never with the ‘dr’ also selected.
What am I doing wrong?

Thanks.

When you look at the meta data displayed in the bottom right corner, does it include a "Pattern" entry? If not, those examples are actually the model's predictions. If you use ner.teach with patterns, Prodigy will mix both pattern results and the model's suggestions. Since the model already knows a lot about the label PERSON, it starts suggesting stuff right away. And the default NER annotation scheme of the PERSON label always excludes the title.

If you reject those examples, do you eventually see the pattern matches? Or do they never show up?

I just entered your pattern in our interactive Matcher demo and it looks alright (ignore the IS_ALPHA on the last token – I had to add this because the demo doesn't yet support the empty dict for "any token"). In fact, I think the {"IS_PUNCT": true, "OP": "?" } token is even redundant, since spaCy's tokenizer treats "Dr." as an exception and won't actually split it. So instead, you might want to add a second pattern with {"LOWER": "dr."}.

I changed the patterns a little, first like:

{"label": "PERSON", "pattern" :[{"LOWER": "dr"}, {}]}
{"label": "PERSON", "pattern" :[{"LOWER": "dr."}, {}]}

next like:

{"label": "PERSON", "pattern" :[{"LOWER": "dr"}, {}]}
{"label": "PERSON", "pattern" :[{"LOWER": "dr."}, {}]}
{"label": "PERSON", "pattern" :[{"LOWER": "dr"}]}

I ran through the entire set until it said ‘No tasks available’ (wasn’t that big). For the first set of pattherns, It never mentioned anything about ‘pattern’ in the lower right corner, but during the second run with the alternative set of patterns, after a while, it did start to say 'pattern. Sometimes correctly, sometimes only for the word ‘dr’.
There were also a number of correct suggestions before it started saying ‘pattern’ (even in the first set of patterns). During the second run, there were more correct suggestions. Am I correct in presuming that it is slowly starting to learn how I would like it to pick persons?

Yes, that makes sense!

The en_core_web_lg model already has a pretty solid concept of what a PERSON is, so it'll take a while to really move it towards your definition of a person. Depending on what you're trying to do, you might want to consider a combination of the statistical model and a rule-based approach: after all, the model is already pretty good at finding persons, and you can easily improve it even more using Prodigy. The only difference is that it doesn't include the titles like "Dr.".

However, if you use spaCy, you'll always have a reference to the tokens surrounding the entitiy. So you could write a custom pipeline component that runs after the regular entity recognizer, checks if the previous token is a title and if so, replaces the entity with a new span that includes the title.

Here's a minimal proof of concept:

import spacy
from spacy.tokens import Span

nlp = spacy.load('en_core_web_sm')  # or any other model
doc = nlp(u"Dr Pulse, Dr Borer; Dr Cure, Dr Gore")  # test regular entities
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Dr Cure', 'PERSON'), ('Gore', 'PERSON')]

def add_titles_to_entities(doc):
    # this component takes a doc, merges the entities and returns it
    titles = ['dr', 'mr', 'dr.', 'mr.']  # etc.
    new_ents = []  # collect the new updated entities for the doc here
    for ent in doc.ents:
        if ent.start == 0 or ent.label_ != 'PERSON':
            # there's no token before the entity or it's not a person
            new_ents.append(ent)  # keep it and exit early
            continue
        prev_token = doc[ent.start - 1]
        if prev_token.lower_ in titles:  
            # previous token text is in list of titles, so create new entity with title
            ent_with_title = Span(doc, ent.start - 1, ent.end, label=ent.label)
            new_ents.append(ent_with_title)
        else:  # keep regular entitiy
            new_ents.append(ent)
    doc.ents = new_ents  # overwrite the entities with the new list
    return doc

# add component to the pipeline – make sure to add it last or after='ner'
nlp.add_pipe(add_titles_to_entities, last=True)
doc = nlp(u"Dr Pulse, Dr Borer; Dr Cure, Dr Gore")  # process text again
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Dr Cure', 'PERSON'), ('Dr Gore', 'PERSON')]

In this example, you can also see that there's still room for improvement: "Borer" isn't recognised as a person at all. But it's usually a lot easier to improve the existing PERSON category and just teach the model to recognise more persons, than it is to change its entire definition and policy, learned from a corpus of over 2 million words.

This was likely a result of this pattern:

{"label": "PERSON", "pattern" :[{"LOWER": "dr"}]}

ok. many thanks. I’ll keep these tips in mind (am prepping to write an anonymization tool)