Merge Entities Error

After I trained model using Merge Entities , I am getting the following error. Do i need to install anything ?

/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use “pip install psycopg2-binary” instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
“”")
Traceback (most recent call last):
File “/usr/lib/python3.5/runpy.py”, line 184, in _run_module_as_main
main”, mod_spec)
File “/usr/lib/python3.5/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/prodigy/main.py”, line 248, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/prodigy/recipes/textcat.py”, line 106, in batch_train
nlp = spacy.load(input_model, disable=[‘ner’])
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/init.py”, line 19, in load
return util.load_model(name, **overrides)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py”, line 117, in load_model
return load_model_from_path(Path(name), **overrides)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py”, line 157, in load_model_from_path
component = nlp.create_pipe(name, config=config)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/language.py”, line 215, in create_pipe
raise KeyError(“Can’t find factory for ‘{}’.”.format(name))
KeyError: "Can’t find factory for ‘merge_entities’.

Thanks for the report! The problem here is that the terms.train-vectors adds a new merge_entities component to the pipeline, which is later added to the model’s meta.json. So when you load the model back in, spaCy is trying to find a factory for that component to initialise it (just like it does for the 'tagger' or 'parser').

Sorry about that – the way this is currently handled is kind of unideal – we need to go back and think about how to best solve this. For now, you could simply remove the 'merge_entities' component from the "pipeline" setting of your model’s meta.json, add the component manually after loading the model:

from prodigy.components.preprocess import merge_entities

nlp = spacy.load('your_model')
nlp.add_pipe(merge_entities, name='merge_entities')

This ensures that the entities are merged so the vectors you’ve trained for the merged entities are available. Here’s the function for reference:

def merge_entities(doc):
    """Preprocess a spaCy doc, merging entities into a single token.
    Best used with nlp.add_pipe(merge_entities).

    doc (spacy.tokens.Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged noun entities.
    """
    spans = [(e.start_char, e.end_char, e.root.tag, e.root.dep, e.label)
             for e in doc.ents]
    for start, end, tag, dep, ent_type in spans:
        doc.merge(start, end, tag=tag, dep=dep, ent_type=ent_type)
    return doc

Alternatively, you could also package your model using the spacy package command and add an entry to Language.factories that initialises the pipeline component – my comments on this thread have more details on this solution.

what would be the method for merge_noun_chunks ?

raise KeyError(“Can’t find factory for ‘{}’.”.format(name))
KeyError: “Can’t find factory for ‘merge_noun_chunks’.”

Sorry, I should have added that one as well. It’s also a preprocessor, so you can import it and add the Prodigy component to your pipeline:

from prodigy.components.preprocess import merge_noun_chunks

nlp = spacy.load('your_model')
nlp.add_pipe(merge_noun_chunks, name='merge_noun_chunks')

Or use the function instead:

def merge_noun_chunks(doc):
    """Preprocess a spaCy Doc, merging noun chunks. Best used with
    nlp.add_pipe(merge_noun_chunks).

    doc (spacy.tokens.Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged noun chunks.
    """
    if not doc.is_parsed:
        return
    spans = [(np.start_char, np.end_char, np.root.tag, np.root.dep)
             for np in doc.noun_chunks]
    for start, end, tag, dep in spans:
        doc.merge(start, end, tag=tag, dep=dep)
    return doc

from prodigy.components.preprocess import merge_noun_chunks
from prodigy.components.preprocess import merge_entities

nlp = spacy.load("/Users/philips/Development/BigData/RS/annotation/Prodigy/Classification_Model/followup_recommendation_radreportw2veconly/")
nlp.add_pipe(merge_noun_chunks, name=‘merge_noun_chunks’)
nlp.add_pipe(merge_entities, name=‘merge_entities’)

I get the following error:


AttributeError Traceback (most recent call last)
in ()
2 from prodigy.components.preprocess import merge_entities
3
----> 4 nlp = spacy.load("/Users/philips/Development/BigData/RS/annotation/Prodigy/Classification_Model/followup_recommendation_radreportw2veconly/")
5 nlp.add_pipe(merge_noun_chunks, name=‘merge_noun_chunks’)
6 nlp.add_pipe(merge_entities, name=‘merge_entities’)

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/spacy/init.py in load(name, **overrides)
17 “to load. For example:\nnlp = spacy.load(’{}’)”.format(depr_path),
18 ‘error’)
—> 19 return util.load_model(name, **overrides)
20
21

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/spacy/util.py in load_model(name, **overrides)
115 return load_model_from_package(name, **overrides)
116 if Path(name).exists(): # path to model data directory
–> 117 return load_model_from_path(Path(name), **overrides)
118 elif hasattr(name, ‘exists’): # Path or Path-like to model data
119 return load_model_from_path(name, **overrides)

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/spacy/util.py in load_model_from_path(model_path, meta, **overrides)
157 component = nlp.create_pipe(name, config=config)
158 nlp.add_pipe(component, name=name)
–> 159 return nlp.from_disk(model_path)
160
161

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/spacy/language.py in from_disk(self, path, disable)
636 if not (path / ‘vocab’).exists():
637 exclude[‘vocab’] = True
–> 638 util.from_disk(path, deserializers, exclude)
639 self._path = path
640 return self

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/spacy/util.py in from_disk(path, readers, exclude)
520 for key, reader in readers.items():
521 if key not in exclude:
–> 522 reader(path / key)
523 return path
524

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/spacy/language.py in (p, proc)
632 if not hasattr(proc, ‘to_disk’):
633 continue
–> 634 deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
635 exclude = {p: False for p in disable}
636 if not (path / ‘vocab’).exists():

nn_parser.pyx in spacy.syntax.nn_parser.Parser.from_disk()

~/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/model.py in from_bytes(self, bytes_data)
349 if isinstance(name, bytes):
350 name = name.decode(‘utf8’)
–> 351 dest = getattr(layer, name)
352 copy_array(dest, param[b’value’])
353 i += 1

AttributeError: ‘FunctionLayer’ object has no attribute ‘vectors’

Meta.json

{
“license”:“CC BY-SA 3.0”,
“url”:“https://explosion.ai”,
“lang”:“en”,
“sources”:[
“OntoNotes 5”,
“Common Crawl”
],
“name”:“core_web_sm”,
“pipeline”:[
“tagger”,
“parser”,
“textcat”
],
“version”:“2.0.0”,
“description”:“English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.”,
“email”:“contact@explosion.ai”,
“speed”:{
“gpu”:null,
“nwords”:291344,
“cpu”:5122.3040471407
},
“parent_package”:“spacy”,
“spacy_version”:">=2.0.0a18",
“author”:“Explosion AI”,
“accuracy”:{
“uas”:91.7237657538,
“ents_f”:85.2975560875,
“ents_r”:85.6312524451,
“ents_p”:84.9664503965,
“tags_acc”:97.0403350292,
“las”:89.800872413,
“token_acc”:99.8698372794
},
“vectors”:{
“vectors”:569319,
“width”:300,
“keys”:503161
}
}

Hi!

I’m not sure if I should start a new topic or post my (what I think is related) question here? But here goes (sorry in advance if it should be a separate topic):

I have trained a danish word2vec model on 2.2 million posts from Facebook pages belonging to Danish media sites and politicians with the intent of building a topic classifier (inspired by @ines video tutorial on how to train an insult classifier). I’ve trained the model using the terms.train-vectors recipe with the merge entities flag. However, I get a similar error when using the terms.teach recipe with the trained model.

If I remove the merge_entities component from the meta.json everything works fine, but obviously the merge_entities component is not used.

Is it possible to modify the terms.teach recipe so that it includes the merge_entities component?

Thanks!

@ronnie Sorry if this was confusing and frustrating – we hadn’t through this through from end to end, so there’s currently an awkward gap here. But the next update to spaCy will include both factories for merge_entities and merge_noun_chunks out of the box. This means that when you load your model and the pipeline specifies one of those components, spaCy will know what to do. (We’re actually just working on that!)

In the meantime, the simplest fix would be to remove the 'merge_entities' from your meta.json and re-add the function manually. From within a Prodigy recipe, you can also just import the component as prodigy.components.preprocess.merge_entities.

def merge_entities(doc):
    spans = [(e.start_char, e.end_char, e.root.tag, e.root.dep, e.label)
             for e in doc.ents]
    for start, end, tag, dep, ent_type in spans:
        doc.merge(start, end, tag=tag, dep=dep, ent_type=ent_type)
    return doc
nlp = spacy.load('/path/to/your/model')
nlp.add_pipe(merge_entities, name='merge_entities', after='ner')

The above solution sill means you have to do this manually after loading the model. A more elegant solution would be to include the component in your model’s __init__.py and then add a factory to Language that lets spaCy initialise your component. My comment on this thread has more details on this.

def entity_merger(nlp, **cfg):
    return merge_entities

Language.factories['merge_entities'] = lambda nlp, **cfg: entity_merger(nlp, **cfg)

You can then package your model with spacy package (this is important, because you want spaCy to execute the package and its __init__.py!) and it will be able to load the merge_entities component. However, since spaCy will be providing a built-in factory for this, you hopefully won’t have to implement this yourself! (It might be useful in the future, though, if you ever end up writing more complex custom components.)

1 Like

Quick update: The following commit adds merge_entities and merge_noun_chunks as built-in factories, so spaCy will be able to create and add them if they're present in a model's meta.json, without requiring custom modifications. The fix will be included in the next spaCy release.

1 Like

Just released spaCy v2.0.10 which includes built-in factories for merge_entities and merge_spans. The new version is compatible with Prodigy v1.4.0, so you should be able to run the following in your Prodigy environment:

pip install -U spacy

I know this thread hasn't been active for a while, but I think this might be a good spot for a question I have that I think is similar. Apologies if not.

I have built a custom pipeline component that retokenizes text based on a regex rule. So "Canon EOS 450D" is now one token as opposed to 3. When I load the model the component works fine. I am trying to use the resulting model with the prodigy ner.teach recipe. But I get the following trace:

KeyError: "[E002] Can't find factory for 'tokenizer'. This usually happens when spaCy callsnlp.create_pipewith a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write toLanguage.factories['tokenizer']or remove it from the model meta and add it vianlp.add_pipeinstead."

Here is my code for the component:

def tokenizer(doc):
    matches = re.compile(PATTERN, re.IGNORECASE).finditer(doc.text)
    with doc.retokenize() as retokenizer:
        for match in matches:
            start, end = bond.span()
            span = doc.char_span(start, end)
            retokenizer.merge(span)
    return doc

Language.factories['tokenizer'] = tokenizer

Not sure if I have understood how to write to factories. The examples in this thread have components that are callable classes with a language object as an initialising param. Should I refactor my code along that pattern, or is there away to stay with components that are functions?

If I do remove the tokenizer from the meta.json, and then build the model up in a script using nlp.add_pipe (which I think is the other possible solution), then use the resulting model in a custom recipe, I am not sure how to get my custom recipe to give me the benefits of active learning that I get when I use ner.teach from the command line.

Any advice would be much appreciated.

(btw if my example seems contrived I apologise. It maps exactly to what I am doing, but I have changed the tokenizer name to conceal the nature of the data I am working with which is proprietary to my client)

by the way I just tried the following

def build_model(model_dir=None):

    nlp = spacy.load('en_core_web_lg')

    nlp.add_pipe(tokenizer, name='tokenizer', first=True)
    nlp.factories['tokenizer'] = tokenizer
    
    if model_dir:
        nlp.to_disk(model_dir)
    return nlp

But no joy there either. Thanks

Hi! Your general approach sounds good, but there are a few problems here:

  1. The tokenizer factory is actually a built-in factory already, so you should be calling your component something else. Maybe token_merger or something like that. Otherwise, you're overwriting the factory for the actual tokenizer that takes a text and turns it into a doc.

  2. Factories are functions that take the nlp object and optional config parameters and return the initialized component. You can see some examples in the built-in factories. Making them functions that return the component function can be useful if your custom component is a class that needs to be initialized with the shared vocab or any other state. In your case, it's just a regular function, so your factory could look like this:

Language.factories["token_merger"] = lambda nlp, **cfg: token_merger
  1. The factory needs to be registered before loading the model. It tells spaCy what to do when it comes across a string name like "token_merger" in your model's meta.json. The hacky way to test it would be to just copy-paste your component code into the recipe. The proper, elegant way would be to use spacy package to turn your model into a Python package and then add your code to the model package's __init__.py. You can find more details in the docs here.

  2. Finally, your component is a bit special, because it changes the tokenization. If you want that custom tokenization reflected during annotation, it might be a better idea to include it in a custom nlp.make_doc method. This runs before the regular pipeline, and Prodigy may also refer to nlp.make_doc, expecting it to return the final tokenized Doc. Here's an example of how to do this in your model package's __init__.py:

# In your model package's __init__.py
def load(**overrides):
    nlp = load_model_from_init_py(__file__, **overrides)

    def custom_make_doc(text):
        doc = nlp.tokenizer(text)
        doc = token_merger(doc)
        return doc

    nlp.make_doc = custom_make_doc
    return nlp

Hey ines thanks for the reply.

Getting closer. In a script before I load a model with a custom pipeline I register the component with the Language class and I am now able to reload my tweaked model. Code something like this:

def load_model(model_dir):
    Language.factories['token_merger'] = lambda nlp, **cfg: token_merger
    nlp = spacy.load(model_dir)
    return nlp

I guess I am still wrestling with how to get prodigy to use the model from the CLI so that I can take advantage of the active learning; (can you sketch or direct me to docs that show how to implement active learning in custom recipies?).

I am thinking that in Spacy's idiom the better (less hacky) way to proceed is to use my regular expressions in a pattern.jsonl file, let prodigy learn to identify mutli-token entities, then merge such entities when the prodigy trained model identifies them. The reason I resisted this, is because my instinct is to take advantage of domain knowledge and send coherent semantic units to the learner, giving Prodigy one less pattern to learn.

Although I suppose the pattern.jsonl does just that.

What you pass in as the model on the command line is passed directly to spacy.load. If you want to execute custom code when the model is loaded, one way would be to use spacy package, add your custom merger to the __init__.py (as described in my previous comment) and then pip install your custom model in the same directory.

However, if you just want to hack around and test things, you can also just find the location of your Prodigy installation and edit the recipes/ner.py. Once you know it all works, you can still make it more elegant. To find the location of your installation, you can use the following one-liner:

python -c "import prodigy; print(prodigy.__file__)"

You can find details on Prodigy built in sorters that perform the example selection (given a stream of (score, example) tuples) in the "Sorters" section of your PRODIGY_README.html. Another good place to start is the prodigy-recipes repo, which includes simplified and explained versions of various recipes: https://github.com/explosion/prodigy-recipes

Oh, in that case: Yes, try just skipping the merging! The named entity recognition model is designed to predict sequences of tokens, so it might do just fine without it. But it's hard to say, maybe training on merged tokens does improve accuracy... definitely worth an experiment!