Adding a custom NER to a pipeline overrides an original NER

Andrey · September 24, 2018, 8:20pm

I want to add a new pipeline component (EntityMatcher) and following an example presented here.

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

list_of_drugs = ['insulin', 'aspirin', 'humalog', 'lantus', 'tamsulosin', 'amlodipine']

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

Now i have two options, first, I start with the original pipeline:

nlp = spacy.load('en_core_web_lg')

doc = nlp(u'Apple is looking at buying U.K. aspirin and tamsulosin startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 52 62 MONEY

Then I want to add a new component:

entity_matcher = EntityMatcher(nlp, list_of_drugs, 'DRUG')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names)
['tagger', 'parser', 'ner', 'entity_matcher']

and then applying to the same text gives only drugs:

doc = nlp(u'Apple is looking at buying U.K. startup for production of aspirin and tamsulosin for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

aspirin 58 65 DRUG
tamsulosin 70 80 DRUG

I’m sure I’m missing here something, but I couldn’t find it in docs. Any simple tweak to combine both the original NER and the custom one?

ines · September 24, 2018, 8:29pm

Yes, in your code, you’re doing doc.ents = spans, which essentially overrides all entities set by the previous component. One quick fix could be to add the entity matcher before the entity recognizer:

nlp.add_pipe(entity_matcher, before='ner')

spaCy’s entity recognizer should respect pre-defined entities, and it can even improve the predictions, since the entities you’ve set manually will define the constraints for the statistical entity recognizer.

For a more advanced approach that combines and overwrites entities dynamically, you might want to check out the EntityRuler component I developed for the upcoming spaCy v2.1.0 (still experimental). In case you haven’t seen it yet, this thread also discusses similar approaches and strategies.

Andrey · September 24, 2018, 8:32pm

Hi Ines,

Many thanks for your very quick reply! I saw your comment on a similar topic and tried to implement as you suggested. However, it takes very long (and still running!) to perform this task and I don’t know why. Normally it takes a few seconds, but now it runs for several minutes. I’m very puzzled.

UPD: I’m using Anaconda and normally trying not to install via ‘pip’, I just checked and on Anaconda they have only 2.012 version, not yet the most recent one.

ines · September 24, 2018, 8:38pm

That's definitely strange! Did you add some print statements to see where it hangs or what takes so long?

Ah, okay - spacy-nightly is only available on pip at the moment, sorry! (It's just a lot easier for us to publish quick updates on pip.) Maybe you can use a separate virtual environment and delete it afterwards, so it doesn't mess with any of your conda installations? That'd be the recommended workflow anyways – you should always keep the nightly version separate.

Andrey · September 24, 2018, 8:40pm

No, nothing extra, apart from you suggestion, it hangs

Sure, I will create a new virtual. env to test the new version. Many thanks for you help!

Andrey · September 24, 2018, 11:14pm

That’s really puzzling that your suggested tweak causes the system to hang. Any ideas how to debug it? I’m very curious, why introducing the new EntityMath before ‘ner’ causes such behaviour.

Moreover, I installed in a separate virtual env. the newest spacy version 2.1.0a1 and it hangs again!

drugs.jsonl:

{"label":"DRUG","pattern":[{"lower":"insulin"}]}
{"label":"DRUG","pattern":[{"lower":"amlodipine"}]}
{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"tamsulosin"}]}
{"label":"DRUG","pattern":[{"lower":"lantus"}]}

and the code:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp).from_disk('drugs.jsonl')
nlp.add_pipe(ruler, before='ner')

doc = nlp(u'Apple is looking at buying U.K. aspirin startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

where the problematic bit is

doc = nlp(u'Apple is looking at buying U.K. aspirin startup for $1 billion')

Topic		Replies	Views
Does the outputted model contain the custom pipeline components? usage , spacy , solved	3	996	November 28, 2018
Using a custom component in NER done , spacy	4	1762	February 23, 2018
updating training pipline of NER from spacy 2 to spacy 3 spacy , off-topic	4	6345	June 24, 2021
Add custom NER model from prodigy to spacy pipeline usage , ner , spacy , solved	3	2152	October 5, 2022
Pattern Matching on Custom Attributes usage , spacy , off-topic	2	700	September 22, 2021

Adding a custom NER to a pipeline overrides an original NER

Related Topics