Thanks for the report! The problem here is that the terms.train-vectors
adds a new merge_entities
component to the pipeline, which is later added to the model’s meta.json
. So when you load the model back in, spaCy is trying to find a factory for that component to initialise it (just like it does for the 'tagger'
or 'parser'
).
Sorry about that – the way this is currently handled is kind of unideal – we need to go back and think about how to best solve this. For now, you could simply remove the 'merge_entities'
component from the "pipeline"
setting of your model’s meta.json
, add the component manually after loading the model:
from prodigy.components.preprocess import merge_entities
nlp = spacy.load('your_model')
nlp.add_pipe(merge_entities, name='merge_entities')
This ensures that the entities are merged so the vectors you’ve trained for the merged entities are available. Here’s the function for reference:
def merge_entities(doc):
"""Preprocess a spaCy doc, merging entities into a single token.
Best used with nlp.add_pipe(merge_entities).
doc (spacy.tokens.Doc): The Doc object.
RETURNS (Doc): The Doc object with merged noun entities.
"""
spans = [(e.start_char, e.end_char, e.root.tag, e.root.dep, e.label)
for e in doc.ents]
for start, end, tag, dep, ent_type in spans:
doc.merge(start, end, tag=tag, dep=dep, ent_type=ent_type)
return doc
Alternatively, you could also package your model using the spacy package
command and add an entry to Language.factories
that initialises the pipeline component – my comments on this thread have more details on this solution.