Error assigning label ID when combining custom NER model from Prodigy with Spacy dependency parsing model

I have a custom ner model trained with “en_vectors_web_lg” as the base model and wish to combine it with one of the core english spacy models to handle dependency parsing so that I examine how my entities are related to each other. I can do this by loading both my custom model and a spacy model, then applying them separately; however, I cannot successfully combine them.

My custom model contains two pipelines:

[('sbd', <spacy.pipeline.SentenceSegmenter at 0x7fc11fff7e48>),
('ner', <spacy.pipeline.EntityRecognizer at 0x7fc1b81ccca8>)]

while the spacy model contains three:

[('tagger', <spacy.pipeline.Tagger at 0x7fc1e5f5b5f8>),
('parser', <spacy.pipeline.DependencyParser at 0x7fc1e5ef80a0>),
('ner', <spacy.pipeline.EntityRecognizer at 0x7fc1e5f0d6d0>)]

I have tried using replace_pipe, loading the spacy model without the NER pipe and using add_pip to add my custom ner model to the spacy model, and also tried adding the tagger and parser from the spacy model to my custom model. anytime I try to use the model do perform the new or updated pipeline I receive an error similar to:

ValueError: [E084] Error assigning label ID 8397771882303007253 to span: not in StringStore.

Based on some github and stackoverflow posts I have tried updating/replacing the vocab without any luck and am still looking for a solution.

It looks to me like you just need to get your labels into the string store. I think where things are going wrong is, if you load two models nlp1 and nlp2, the pipeline components in the two models will have different StringStore and Vocab instances. I think there’s somewhere in spaCy that’s assuming that the component’s string store is the same as the Doc object’s, as this is normally the case.

You could do something like:


for label in labels:
    nlp.vocab.strings.add(label)

But actually I think you might find the easiest solution is to just merge the directories. You can copy the model files for the pipeline components into one model directory, and then just edit the meta.json. This should give you an easy and reliable way to combine your components.

I have a custom model with NER and TEXTCAT in pipelines using en_vectors_web_lg base model. Now I want to add Parser and Tagger to the pipeline. I excracted them from en_core_web_lg models and added them to my pipeline. But when I save and load back the model, I get this error.

OSError: [E050] Can't find model 'en_core_web_lg.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

any recommendation to add pre-trained Parser and Tagger to my custom model?

I think the problem here is that the vectors are inconsistent – if two components were trained with vectors A and the other components were trained with vectors B, combining them wouldn't work, because the components trained with vectors A would output very different and potentially completely useless predictions with vectors B, and vice versa. So the easiest solution would be to just retrain your NER and textcat model using the same vectors used by the other components.

Is there any difference between Tagger and Parser in en_core_web_lg and en_core_web_sm.
Just tried with small version, and did not get any errors.

The en_core_web_sm model doesn't use any word vectors – the en_core_web_lg model does, and all components were trained with word vectors as features. That's what makes them more accurate overall. But it also means you can't just take the vectors away.