Adding a custom NER to a pipeline overrides an original NER

I want to add a new pipeline component (EntityMatcher) and following an example presented here.

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

list_of_drugs = ['insulin', 'aspirin', 'humalog', 'lantus', 'tamsulosin', 'amlodipine']

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

Now i have two options, first, I start with the original pipeline:

nlp = spacy.load('en_core_web_lg')

doc = nlp(u'Apple is looking at buying U.K. aspirin and tamsulosin startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 52 62 MONEY

Then I want to add a new component:

entity_matcher = EntityMatcher(nlp, list_of_drugs, 'DRUG')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names)
['tagger', 'parser', 'ner', 'entity_matcher']

and then applying to the same text gives only drugs:

doc = nlp(u'Apple is looking at buying U.K. startup for production of aspirin and tamsulosin for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

aspirin 58 65 DRUG
tamsulosin 70 80 DRUG

I’m sure I’m missing here something, but I couldn’t find it in docs. Any simple tweak to combine both the original NER and the custom one?

1 Like

Yes, in your code, you’re doing doc.ents = spans, which essentially overrides all entities set by the previous component. One quick fix could be to add the entity matcher before the entity recognizer:

nlp.add_pipe(entity_matcher, before='ner')

spaCy’s entity recognizer should respect pre-defined entities, and it can even improve the predictions, since the entities you’ve set manually will define the constraints for the statistical entity recognizer.

For a more advanced approach that combines and overwrites entities dynamically, you might want to check out the EntityRuler component I developed for the upcoming spaCy v2.1.0 (still experimental). In case you haven’t seen it yet, this thread also discusses similar approaches and strategies.

Hi Ines,

Many thanks for your very quick reply! I saw your comment on a similar topic and tried to implement as you suggested. However, it takes very long (and still running!) to perform this task and I don’t know why. Normally it takes a few seconds, but now it runs for several minutes. I’m very puzzled.

UPD: I’m using Anaconda and normally trying not to install via ‘pip’, I just checked and on Anaconda they have only 2.012 version, not yet the most recent one.

That's definitely strange! Did you add some print statements to see where it hangs or what takes so long?

Ah, okay - spacy-nightly is only available on pip at the moment, sorry! (It's just a lot easier for us to publish quick updates on pip.) Maybe you can use a separate virtual environment and delete it afterwards, so it doesn't mess with any of your conda installations? That'd be the recommended workflow anyways – you should always keep the nightly version separate.

No, nothing extra, apart from you suggestion, it hangs :confused:

Sure, I will create a new virtual. env to test the new version. Many thanks for you help!

That’s really puzzling that your suggested tweak causes the system to hang. Any ideas how to debug it? I’m very curious, why introducing the new EntityMath before ‘ner’ causes such behaviour.

Moreover, I installed in a separate virtual env. the newest spacy version 2.1.0a1 and it hangs again!

drugs.jsonl:

{"label":"DRUG","pattern":[{"lower":"insulin"}]}
{"label":"DRUG","pattern":[{"lower":"amlodipine"}]}
{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"tamsulosin"}]}
{"label":"DRUG","pattern":[{"lower":"lantus"}]}

and the code:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp).from_disk('drugs.jsonl')
nlp.add_pipe(ruler, before='ner')

doc = nlp(u'Apple is looking at buying U.K. aspirin startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

where the problematic bit is

doc = nlp(u'Apple is looking at buying U.K. aspirin startup for $1 billion')