Error while running terms.teach (E018)

I was following along with this video, trying to get my own set of terms for a slightly different vector space. I did everything the same, but when I ran the following line:

prodigy terms.teach symptoms_seeds en_vectors_web_lg --seeds starter_symptoms.txt

I get the following error output:

ℹ Initializing with 8 seed terms from starter_symptoms.txt
Traceback (most recent call last):
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 300, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/prodigy/recipes/terms.py", line 58, in teach
    nlp.vocab[s]
  File "vocab.pyx", line 249, in spacy.vocab.Vocab.__getitem__
  File "lexeme.pyx", line 47, in spacy.lexeme.Lexeme.__init__
  File "vocab.pyx", line 166, in spacy.vocab.Vocab.get_by_orth
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '4035656307355538346'. This usually refers to an issue with the `Vocab` or `StringStore`."

I'm not quite sure how to fix this. Any pointers? Am I doing something wrong? I'm following exactly as @ines did it in the video...

Thank you!

You didn't do anything wrong. You were unlucky enough to hit one of the 4 (out of 1.1M) vectors with missing strings in this model. I noticed that a few were missing when I repackaged it for spacy v2.3.0, but I decided to leave those vectors in (keeping it identical to previous versions) since usually in spacy you start with a text before looking up vectors and you'd never notice that the strings were missing. Only if you're using some of the vector similarity methods do you go from vectors to words instead of the other way around.

But this is pretty problematic for you here and you need a version of this model that doesn't contain these vectors. Let's see, I think this is the easiest way:

nlp = spacy.load("en_vectors_web_lg")
for key in list(nlp.vocab.vectors.key2row):
    try:
        word = nlp.vocab.strings[key]
    except KeyError:
        del nlp.vocab.vectors.key2row[key]
nlp.to_disk("/path/to/mod_en_vectors_web_lg")

Then use the full path to the saved model (/path/to/mod_en_vectors_web_lg) as the model with terms.teach instead of en_vectors_web_lg. If you want to have this installed as a python package with pip, you can modify the model name and the vectors name in meta.json and package it using spacy package.

I'm a little hesitant to release a modified version of this model at this point because it's been used in so many different places over the years. We'll have to think about what makes sense here. Sorry you ran into this bug!

1 Like

That is perfect. Solution worked fine. Thank you for this clear and helpful response!

Please will someone walk me through this solution step by step?

@adriane please can you walk me through this?

@jal I think the solution might be simpler than you think :slightly_smiling_face: The code snippet posted above is a standalone script you can run that saves out a modified version of the vectors model that doesn't include any of the missing strings.

It saves the modified vectors model out to a path, and you can then use that path as the input model, instead of en_vectors_web_lg.

Thank you. It worked.

I was also unlucky to hit the hash '4035656307355538346' in my 7-term initialisation :slight_smile: Happy this post exists.

1 Like