Is it possible for the entities tagged and merged in one document to be respected when passed to another spacy.load() model?

For important reasons, we have two entirely separate models trained and made. I want to add to the entities already tagged by model A by passing the spaCy document to model B.

But model B seems to destroy any work done by model A. How can I supplement and respect/never overwrite the entities tagged by model A? My intuition is that this can be done by adding model B as a custom pipeline component to model A, but even there wouldn't entity attributes be overwritten?

I'm "manually" working around this with this, but is there a spaCy way to preserve entities?

from spacy.pipeline import merge_entities

sent = 'Patient took insulin glargine as needed and she took tylenol for two weeks.'
a = drug(sent)  
b = med(sent)

indexes = [(token.ent_type_, token.text) if token.ent_type_ == 'DRUG' else None for token in a]
indexes2 = [(token.ent_type_, token.text) if token.ent_type_ in attributes else None for token in b]
>>> 13 [None, None, ('DRUG', 'insulin glargine'), None, None, None, None, None, ('DRUG', 'tylenol'), None, None, None, None]
>>> 10 [None, None, None, ('FREQUENCY', 'as needed'), None, None, None, None, ('DURATION', 'for two weeks'), None]

# The goal: 'Patient took insulin glargine as needed and she took tylenol for two weeks.'
>>> 10 [None, None, ('DRUG', 'insulin glargine'), ('FREQUENCY', 'as needed'), None, None, None, ('DRUG','tylenol'), ('DURATION', 'for two weeks'), None]

This assumes you have two "ner" components (EntityRecognizer, with trained models) rather than EntityRuler or some other custom component that adds entities, since they can behave slightly differently when combined. (I'm not sure what the set up is when you see model B destroying annotation from model A? That shouldn't happen with "ner" components and with the EntityRuler only if you've set overwrite_ents=True.)

You should be able to add the NER component from one model into another as long as they are loaded with the same vocab. (All the models and components and docs need to share the same vocab or things will go wrong with the StringStore.)

As long as both models have the same language and the exact same vectors (or no vectors), I think you should be able to do something like this:

import spacy

nlp1 = spacy.load("model1")
nlp2 = spacy.load("model2", vocab=nlp1.vocab)

# check to be sure this worked as expected
assert nlp1.vocab == nlp2.vocab

# give the second ner component a custom name because there's already an "ner" in the pipeline
nlp1.add_pipe(nlp2.get_pipe("ner"), name="model2_ner")

doc = nlp1(text)

The first NER component will have priority and the second component should only add new entities where there are no entities marked by the first, basically just where there's O if you look at the word-level IOB tags.

The NER model predicts word by word, which means that you might end up with partial entities when the second model starts predicting a new entity but runs into an existing entity before it gets to where the point where it might have ended the entity otherwise. If the predicted spans for the two models don't overlap that much, this shouldn't be that much of a problem, but if an entity type in the first model is often nested in an entity type in the second, be aware that you may see weird partial spans.

The predictions of the second model will also be influenced by the annotations from the first model as it predicts around the existing spans, so you may not see the exact same results that you would on a blank document.

You only want to merge the entities after both components have run. Each separate model is trained on unmerged text, so if you merge the tokens from the first model before annotating it with the second model, the second model won't work well because it hasn't seen input sequences like this before.

1 Like

This is excellent! Wow, your solution might work better than anything I had imagined.

Just checked, and it looks like the two vocabs are different between model A and B. Model B is a model trained by someone else entirely and I don't have the training data, the model A by me. How could I make two vocabs match so that I could pull the 'ner' component from the second model and add as a pipeline component to the first?

Edit to add: So far, merely using model A's vocab seems to pass the sniff test. I noticed that both models' vocab len() are 503 and 506. 'en_core_web_md' is 496. I thought lexemes were akin to words.

Edit again to add: Is it possible for a custom NER pipeline to only run on those sentences where a specific NER label is present? This would cut down immensely on compute time. But right now, the pipeline runs on the entire document passed to it.

Thank you!

Each pipeline component should handle adding its own labels as strings to the StringStore (e.g., when you load the second model it will add FREQUENCY if it's not already there).

You can think of the vocab as more of a cache of tokens you've seen before. It will grow as you process new texts with your pipeline, so you don't need to be concerned that it starts out small or that the two models have slightly different sizes on init.

You can extract sentences with entities and convert them to standalone docs pretty easily. You still need to make sure the vocabs are the same, as before:

nlp2 = spacy.load("model2", vocab=nlp.vocab)

for ent in doc.ents:
    # convert the sentence containing this entity to a doc
    sent_doc = ent.sent.as_doc()
    # apply just the NER component from another pipeline to this doc
    sent_doc = nlp2.get_pipe("ner")(sent_doc)
    # ...

If you want to add the annotations from sent_doc to the original doc, you'll have to adjust the offsets and add them back in by hand.