Error while running terms.teach (E018)

I was following along with this video, trying to get my own set of terms for a slightly different vector space. I did everything the same, but when I ran the following line:

prodigy terms.teach symptoms_seeds en_vectors_web_lg --seeds starter_symptoms.txt

I get the following error output:

ℹ Initializing with 8 seed terms from starter_symptoms.txt
Traceback (most recent call last):
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 300, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/strickvl/opt/anaconda3/lib/python3.7/site-packages/prodigy/recipes/terms.py", line 58, in teach
    nlp.vocab[s]
  File "vocab.pyx", line 249, in spacy.vocab.Vocab.__getitem__
  File "lexeme.pyx", line 47, in spacy.lexeme.Lexeme.__init__
  File "vocab.pyx", line 166, in spacy.vocab.Vocab.get_by_orth
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '4035656307355538346'. This usually refers to an issue with the `Vocab` or `StringStore`."

I'm not quite sure how to fix this. Any pointers? Am I doing something wrong? I'm following exactly as @ines did it in the video...

Thank you!

You didn't do anything wrong. You were unlucky enough to hit one of the 4 (out of 1.1M) vectors with missing strings in this model. I noticed that a few were missing when I repackaged it for spacy v2.3.0, but I decided to leave those vectors in (keeping it identical to previous versions) since usually in spacy you start with a text before looking up vectors and you'd never notice that the strings were missing. Only if you're using some of the vector similarity methods do you go from vectors to words instead of the other way around.

But this is pretty problematic for you here and you need a version of this model that doesn't contain these vectors. Let's see, I think this is the easiest way:

nlp = spacy.load("en_vectors_web_lg")
for key in list(nlp.vocab.vectors.key2row):
    try:
        word = nlp.vocab.strings[key]
    except KeyError:
        del nlp.vocab.vectors.key2row[key]
nlp.to_disk("/path/to/mod_en_vectors_web_lg")

Then use the full path to the saved model (/path/to/mod_en_vectors_web_lg) as the model with terms.teach instead of en_vectors_web_lg. If you want to have this installed as a python package with pip, you can modify the model name and the vectors name in meta.json and package it using spacy package.

I'm a little hesitant to release a modified version of this model at this point because it's been used in so many different places over the years. We'll have to think about what makes sense here. Sorry you ran into this bug!

1 Like

That is perfect. Solution worked fine. Thank you for this clear and helpful response!

Please will someone walk me through this solution step by step?

@adriane please can you walk me through this?

@jal I think the solution might be simpler than you think :slightly_smiling_face: The code snippet posted above is a standalone script you can run that saves out a modified version of the vectors model that doesn't include any of the missing strings.

It saves the modified vectors model out to a path, and you can then use that path as the input model, instead of en_vectors_web_lg.

Thank you. It worked.

I was also unlucky to hit the hash '4035656307355538346' in my 7-term initialisation :slight_smile: Happy this post exists.

1 Like

Hi, I was able to create a modified model with the script suggested above and saved it in the site-packages directory of my virtualenv, where the original model is too. However, I'm not able to call the model it seems, I've tried using the model name only in the command, the relative and absolute path to the model, all result in the OSError: [E050] Can't find model

Here's the (slightly edited) error log:

(prodvenv3.8) ➜  Prodigy prodigy terms.teach provision_dataset mod_en_core_web_lg --seeds "seedterm1, seedterm2.."
ℹ Initializing with 10 seed terms
seedterm1, seedterm2..
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.8/3.8.8_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/python@3.8/3.8.8_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/michele/devel/Prodigy/prodvenv3.8/lib/python3.8/site-packages/prodigy/__main__.py", line 53, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 321, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/michele/devel/Prodigy/prodvenv3.8/lib/python3.8/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/michele/devel/Prodigy/prodvenv3.8/lib/python3.8/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/michele/devel/Prodigy/prodvenv3.8/lib/python3.8/site-packages/prodigy/recipes/terms.py", line 55, in teach
    nlp = spacy.load(vectors)
  File "/Users/michele/devel/Prodigy/prodvenv3.8/lib/python3.8/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/Users/michele/devel/Prodigy/prodvenv3.8/lib/python3.8/site-packages/spacy/util.py", line 175, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'mod_en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

any hints would be welcome.. slightly troubled that it seems to call a version of python in /Cellar but maybe that's ok?

You shouldn't have to put the model in the site-packages (where Python expects to find installed packages) – you can also just load it from a path. So instead of loading en_core_web_lg, you'd load from /path/to/mod_en_core_web_lg.

thank you Ines, moving the model into my main directory where I'm running Prodigy from and giving the absolute path in the command rather than the relative one has fixed it for me.

Hi, I met the same kind of error when using sense2vec.teach, but the previous solution for term.teach seems not work for it:

I set my nlp firstly from en_vectors_web_lg, then added the pipeline of s2v_reddit_2019_lg that I downloaded from your website:

nlp = spacy.load("en_core_web_lg")
from sense2vec import Sense2Vec, Sense2VecComponent
s2v = Sense2VecComponent(nlp.vocab).from_disk("../Prodigy_anotation/s2v_reddit_2019_lg")

then execute the command of sense2vec.teach:

prodigy sense2vec.teach termsg06f "../Prodigy_anotation/s2v_reddit_2019_lg"
--seeds "electronic device user interface, control unit,...(about 6000 terms)"

And it returns the error as :

Traceback (most recent call last):
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/site-packages/prodigy/__main__.py", line 53, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 353, in prodigy.core._components_to_ctrl
  File "cython_src/prodigy/core.pyx", line 142, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 155, in prodigy.components.feeds.SharedFeed.validate_stream
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/site-packages/toolz/itertoolz.py", line 376, in first
    return next(iter(seq))
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/site-packages/sense2vec/prodigy_recipes.py", line 113, in get_stream
    most_similar = s2v.most_similar(accept_keys, n=n_similar)
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/site-packages/sense2vec/sense2vec.py", line 232, in most_similar
    result = [(self.strings[key], score) for key, score in result if key]
  File "/Users/zuoyou/opt/anaconda3/lib/python3.7/site-packages/sense2vec/sense2vec.py", line 232, in <listcomp>
    result = [(self.strings[key], score) for key, score in result if key]
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '12141139319163549496'. This usually refers to an issue with the `Vocab` or `StringStore`."

I tried the previous method to remove the missing keys, but when I wanted to save the new nlp to my current directory it will return the error as:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-47993abdb9e6> in <module>
----> 1 nlp.to_disk("../Prodigy_anotation/s2v_reddit_2019_lg_fixed/")

~/opt/anaconda3/lib/python3.7/site-packages/spacy/language.py in to_disk(self, path, exclude, disable)
925             serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
926         serializers["vocab"] = lambda p: self.vocab.to_disk(p)
--> 927         util.to_disk(path, serializers, exclude)
928 
929     def from_disk(self, path, exclude=tuple(), disable=None):

~/opt/anaconda3/lib/python3.7/site-packages/spacy/util.py in to_disk(path, writers, exclude)
679         # Split to support file names like meta.json
680         if key.split(".")[0] not in exclude:
--> 681             writer(path / key)
682     return path
683 

~/opt/anaconda3/lib/python3.7/site-packages/spacy/language.py in <lambda>(p, proc)
923             if not hasattr(proc, "to_disk"):
924                 continue
--> 925             serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
926         serializers["vocab"] = lambda p: self.vocab.to_disk(p)
927         util.to_disk(path, serializers, exclude)

TypeError: to_disk() got an unexpected keyword argument 'exclude'

I don't understand why it happens, could you please help me checking this? Thank you.

Hi! You shouldn't have to load any vectors in Python – to use the sense2vec.teach recipe, you'll only need to provide the path to the sense2vec vectors you downloaded. The workaround for the regular spaCy word vectors described in this thread is a bit different, and it's not going to work for the standalone sense2vec vectors. The underlying problem might also be a very different one.

This isn't directly related to the issue, but 6000 seed terms is a lot and likely way too much. It's very easy to end up with a much less useful target vctor this way. Essentially, what the recipe does is this: it'll look up the vectors for all your seed terms in the vectors table, compute the average of all of them, and then find the n most similar entries in the table, based on the target vector. So you often only need a handful of terms to end up in the right vector space. So it's usually better to focus on a smaller set of the most relevant seed terms.

It'd be interesting to check which keys in your terms are not found in the StringStore. Maybe try running this for your seed terms and see which one it fails for:

from sense2vec import Sense2Vec

seeds = ["some term", "some other term"]  # your seeds here
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2019_lg")
accept_keys = []
for seed in seeds:
    key = s2v.get_best_sense(seed)
    if key is not None:
        print(key)
        s2v.most_similar([key])

Thanks a lot for your explanation in detail Ines.

It's true it works better with a smaller list of seeds. And it seems there is a problem with the downloaded file of s2v_reddit_2019_lg, so I just download it again and now it works.

1 Like

4 posts were split to a new topic: E018 when fine-tuning parser