E018 when fine-tuning parser

inceatakan · September 1, 2021, 6:08pm

I am trying to fine-tune the parser component using en_core_web_trf using the same train and test datasets in jsonl format (which I used with Spacy V2 and Prodigy 1.10 and worked fine that time). I get the same error:

"KeyError: "[E018] Can't retrieve string for hash '14000015214052600094'. This usually refers to an issue with the Vocab or StringStore."

Spacy Version: 3.1.1
Prodigy Version: 1.11.2
OS: macOS Big Sur

I tried the function below but still receive the same error message:

nlp = spacy.load("en_core_web_trf")
for key in list(nlp.vocab.vectors.key2row):
try:
word = nlp.vocab.strings[key]
except KeyError:
del nlp.vocab.vectors.key2row[key]
nlp.to_disk("/path/to/mod_en_core_web_trf")

When I use "/path/to/mod_en_core_web_trf" as the base model, I get the same error.

I also installed en_vectors_web_lg and try to load it, I get the following error message:

OSError: [E053] Could not read config.cfg from /Users/atakanince/groupsolver_env/lib/python3.8/site-packages/en_vectors_web_lg/en_vectors_web_lg-2.3.0/config.cfg

Help is much appreciated.

Thanks in advance,

-Atakan

ines · September 2, 2021, 9:28am

Could you share the full traceback of where this error occurs under the hood?

The workaround you're using doesn't really work here because en_core_web_trf model doesn't have any word vectors. If you want to use word vectors, you can download en_core_web_lg model (we don't ship a separate vectors-only package for spaCy v3 anymore because it's kinda redundant).

inceatakan · September 2, 2021, 5:43pm

Thank you Ines. Below is the code I run with the try model and the full traceback of the error message:

(spacy_venv) jupyter@research-experiments-test-2:~/spacy$ python -m prodigy train --gpu-id 0 -l en --parser train_dependency,eval:test_dependency --base-model en_core_web_trf ./spacy3_trf_tuned_parser_test | tee logs_trf_parser_test.txt
[2021-09-02 17:35:49,015] [INFO] Set up nlp object from config
Components: parser
Merging training and evaluation data for 1 components

[parser] Training: 627 | Evaluation: 100 (from datasets)
Training: 618 | Evaluation: 100
Labels: parser (42)
[2021-09-02 17:35:49,270] [INFO] Pipeline: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2021-09-02 17:35:49,270] [INFO] Resuming training for: ['parser', 'transformer']
[2021-09-02 17:35:49,276] [INFO] Created vocabulary
[2021-09-02 17:35:49,278] [INFO] Finished initializing nlp object
[2021-09-02 17:35:49,278] [INFO] Initialized pipeline components: []
Components: parser
Merging training and evaluation data for 1 components
[parser] Training: 627 | Evaluation: 100 (from datasets)
Training: 618 | Evaluation: 100
Labels: parser (42)
Using GPU: 0
========================= Generating Prodigy config =========================
Auto-generating config with spaCy
Using config from base model
Generated training config
=========================== Initializing pipeline ===========================
Initialized pipeline
============================= Training pipeline =============================
Pipeline: ['transformer', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
Frozen components: ['tagger', 'attribute_ruler', 'lemmatizer',
'ner']
Initial learn rate: 0.0
E # LOSS TRANS... LOSS PARSER DEP_UAS DEP_LAS SENTS_F SCORE

0 0 31.85 10.91 90.57 89.25 95.57 0.90
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 327, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/plac_core.py", line 232, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/prodigy/recipes/train.py", line 283, in train
silent=silent,
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/prodigy/recipes/train.py", line 197, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/training/loop.py", line 122, in train
raise e
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/training/loop.py", line 105, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/training/loop.py", line 209, in train_while_improving
annotates=annotating_components,
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/language.py", line 1123, in update
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
File "spacy/pipeline/transition_parser.pyx", line 387, in spacy.pipeline.transition_parser.Parser.update
File "spacy/pipeline/transition_parser.pyx", line 638, in spacy.pipeline.transition_parser.Parser._init_gold_batch
File "spacy/pipeline/_parser_internals/arc_eager.pyx", line 649, in spacy.pipeline._parser_internals.arc_eager.ArcEager.init_gold
File "spacy/pipeline/_parser_internals/arc_eager.pyx", line 673, in spacy.pipeline._parser_internals.arc_eager.ArcEager._replace_unseen_labels
File "spacy/strings.pyx", line 132, in spacy.strings.StringStore.getitem
KeyError: "[E018] Can't retrieve string for hash '940378387113885398'. This usually refers to an issue with the Vocab or StringStore."
Aborting and saving the final best model. Encountered exception:
KeyError("[E018] Can't retrieve string for hash '940378387113885398'. This
usually refers to an issue with the Vocab or StringStore.")

inceatakan · September 2, 2021, 5:46pm

I also tried fine-tuning with en_core_web_lg and modified en_core_web_lg, but get the error message below:

(spacy_venv) jupyter@research-experiments-test-2:~/spacy$ python -m prodigy train --gpu-id 0 -l en --parser train_dependency,eval:test_dependency --base-model ./mod_en_core_web_lg ./spacy3_lg_tuned_parser_test | tee logs_lg_parser_test.txt
[2021-09-02 17:30:04,770] [INFO] Set up nlp object from config
Components: parser
Merging training and evaluation data for 1 components

[parser] Training: 627 | Evaluation: 100 (from datasets)
Training: 618 | Evaluation: 100
Labels: parser (42)
[2021-09-02 17:30:05,134] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2021-09-02 17:30:05,135] [INFO] Resuming training for: ['parser', 'tok2vec']
[2021-09-02 17:30:05,141] [INFO] Created vocabulary
[2021-09-02 17:30:07,121] [INFO] Added vectors: ./mod_en_core_web_lg
[2021-09-02 17:30:09,089] [INFO] Finished initializing nlp object
[2021-09-02 17:30:09,090] [INFO] Initialized pipeline components: []
Components: parser
Merging training and evaluation data for 1 components
[parser] Training: 627 | Evaluation: 100 (from datasets)
Training: 618 | Evaluation: 100
Labels: parser (42)
Using GPU: 0
========================= Generating Prodigy config =========================
Auto-generating config with spaCy
Using config from base model
Generated training config
=========================== Initializing pipeline ===========================
Initialized pipeline
============================= Training pipeline =============================
Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
Frozen components: ['tagger', 'attribute_ruler', 'lemmatizer',
'ner']
Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS PARSER DEP_UAS DEP_LAS SENTS_F SCORE

0 0 47.48 9.06 87.69 86.48 100.00 0.88
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 327, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/plac_core.py", line 232, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/prodigy/recipes/train.py", line 283, in train
silent=silent,
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/prodigy/recipes/train.py", line 197, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/training/loop.py", line 122, in train
raise e
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/training/loop.py", line 105, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/training/loop.py", line 209, in train_while_improving
annotates=annotating_components,
File "/home/jupyter/spacy/spacy_venv/lib/python3.7/site-packages/spacy/language.py", line 1123, in update
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
File "spacy/pipeline/transition_parser.pyx", line 387, in spacy.pipeline.transition_parser.Parser.update
File "spacy/pipeline/transition_parser.pyx", line 638, in spacy.pipeline.transition_parser.Parser._init_gold_batch
File "spacy/pipeline/_parser_internals/arc_eager.pyx", line 649, in spacy.pipeline._parser_internals.arc_eager.ArcEager.init_gold
File "spacy/pipeline/_parser_internals/arc_eager.pyx", line 673, in spacy.pipeline._parser_internals.arc_eager.ArcEager._replace_unseen_labels
File "spacy/strings.pyx", line 132, in spacy.strings.StringStore.getitem
KeyError: "[E018] Can't retrieve string for hash '940378387113885398'. This usually refers to an issue with the Vocab or StringStore."
Aborting and saving the final best model. Encountered exception:
KeyError("[E018] Can't retrieve string for hash '940378387113885398'. This
usually refers to an issue with the Vocab or StringStore.")

ines · September 5, 2021, 2:46am

Thanks for the details! I moved this to a new thread, since it seems to be a different problem.

From the error message, it sounds like you might have ended up with labels in your training data that aren't added to the model? If you run the training with --verbose, you should be able to see all the label in the data so you can check if there's anything suspicious in there.

inceatakan · September 17, 2021, 5:11pm

Sorry for the late response Ines! Since I could not find the other thread, I am writing here.

Here's what we did to fix the issue:

We cloned the spacy repo from github and kept editing the source code, recompiling cython files, and reinstalling the modified package until we understood what's wrong under the hood.
our first intuition was correct, missing strings are not added to the stringstore, but that is also happening also for missing labels. it turns out there is an intermediary step with something called projectized dependency labels, and they have a weird structure like attr||pobj and attr|| etc.
in the end, they were missing in pretrained spacy model stringstores, and they were the ones what triggered that hash error message.
We wrote all possible combinations to the stringstore in the base model en_core_web_trf, made sure they have a corresponding hash value available, and started training with all components on the saved and modified version.

Thanks for your messages, as always.

Yours,
-Atakan

adriane · September 21, 2021, 8:42am

Hmm, the parser should add all the projectivized label strings to the StringStore, and we haven't run into this issue just with spacy, although it's rare for users to fine-tune the parser. I suspect that there's an interaction in how prodigy is loading the data for the training task. We will take a look, since these additional steps definitely shouldn't be necessary.

kswanjitsu · September 29, 2021, 7:12pm

Hello! I think I am running into a similar issue. I am trying to train from the rel.manual recipe for hypernyms and hyponyms.

I annotated via the following command:
prodigy rel.manual hypernym_NER en_core_web_lg "./datasets/hearst_hypernym_sentences_raw_text_handmade.txt" --label HYPER,HYPO,PATTERN --span-label HYPER,HYPO,PATTERN

(base) karl@karlkruncher:~/PycharmProjects/doctorlingo/test_scripts/cwi-master/CWI_Sequence_Labeller$ prodigy train "./NER_hypernym_model" --parser hypernym_NER --base-model en_core_web_lg --eval-split 0.1 --label-stats --gpu-id 0 -V
Using GPU: 0
/home/karl/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py:106: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at Start Locally | PyTorch

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

========================= Generating Prodigy config =========================
Auto-generating config with spaCy
Using config from base model
Generated training config

=========================== Initializing pipeline ===========================
[2021-09-29 15:08:42,987] [DEBUG] Replacing listeners of component 'tagger'
[2021-09-29 15:08:45,600] [INFO] Set up nlp object from config
Components: parser
Merging training and evaluation data for 1 components

[parser] Training: 27 | Evaluation: 2 (10% split)
Training: 27 | Evaluation: 2
Labels: parser (3)
[parser] HYPER, HYPO, PATTERN
[2021-09-29 15:08:45,616] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2021-09-29 15:08:45,616] [INFO] Resuming training for: ['parser', 'tok2vec']
[2021-09-29 15:08:45,620] [INFO] Created vocabulary
[2021-09-29 15:08:47,207] [INFO] Added vectors: en_core_web_lg
[2021-09-29 15:08:48,637] [INFO] Finished initializing nlp object
[2021-09-29 15:08:48,638] [INFO] Initialized pipeline components: []
Initialized pipeline

============================= Training pipeline =============================
Components: parser
Merging training and evaluation data for 1 components

[parser] Training: 27 | Evaluation: 2 (10% split)
Training: 27 | Evaluation: 2
Labels: parser (3)
[parser] HYPER, HYPO, PATTERN
Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
Frozen components: ['tagger', 'attribute_ruler', 'lemmatizer',
'ner']
Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS PARSER DEP_UAS DEP_LAS SENTS_F SCORE

[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,652] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,653] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,654] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
[2021-09-29 15:08:48,655] [DEBUG] [W026] Unable to set all sentence boundaries from dependency parses. If you are constructing a parse tree incrementally by setting token.head values, you can probably ignore this warning. Consider using Doc(words, ..., heads=heads, deps=deps) instead.
Aborting and saving the final best model. Encountered exception:
KeyError("[E018] Can't retrieve string for hash '16588043228098313248'. This
usually refers to an issue with the Vocab or StringStore.")
Traceback (most recent call last):
File "/home/karl/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/karl/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/karl/anaconda3/lib/python3.8/site-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/karl/anaconda3/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/karl/anaconda3/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/home/karl/anaconda3/lib/python3.8/site-packages/prodigy/recipes/train.py", line 277, in train
return _train(
File "/home/karl/anaconda3/lib/python3.8/site-packages/prodigy/recipes/train.py", line 197, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "/home/karl/anaconda3/lib/python3.8/site-packages/spacy/training/loop.py", line 122, in train
raise e
File "/home/karl/anaconda3/lib/python3.8/site-packages/spacy/training/loop.py", line 105, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/home/karl/anaconda3/lib/python3.8/site-packages/spacy/training/loop.py", line 203, in train_while_improving
nlp.update(
File "/home/karl/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1122, in update
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
File "spacy/pipeline/transition_parser.pyx", line 387, in spacy.pipeline.transition_parser.Parser.update
File "spacy/pipeline/transition_parser.pyx", line 638, in spacy.pipeline.transition_parser.Parser._init_gold_batch
File "spacy/pipeline/_parser_internals/arc_eager.pyx", line 649, in spacy.pipeline._parser_internals.arc_eager.ArcEager.init_gold
File "spacy/pipeline/_parser_internals/arc_eager.pyx", line 673, in spacy.pipeline._parser_internals.arc_eager.ArcEager._replace_unseen_labels
File "spacy/strings.pyx", line 132, in spacy.strings.StringStore.getitem
KeyError: "[E018] Can't retrieve string for hash '16588043228098313248'. This usually refers to an issue with the Vocab or StringStore."

I have not tried Atakan's solution yet, it seems like a lot of work. Has this been looked into since last week?

adriane · September 30, 2021, 7:39am

Thanks for the report, I can reproduce this error. It does look like it's a problem related to how prodigy is loading the training examples in the background, but I'm still looking into the details.

adriane · September 30, 2021, 8:31am

Ah, no, it turns out that it's a bug in spacy related to how the StringStore is initialized in "sourced" components. We are working on a fix!

kswanjitsu · September 30, 2021, 11:30am

Thank you! In the meantime would change to a previous version help? I did train via rel recipe earlier in 2021 for a test model, but honestly, I forgot which spaCy/prodigy version I was using. I can figure that one out at least, and we do happen to be on a crunch for submitting a manuscript and we are using spaCy/prodigy in our pipeline and for a majority of the ML, which I do love

adriane · September 30, 2021, 1:18pm

As long as all the pipelines you're sourcing from have the same vectors (which I think they ought to in this setup with a "base model"), then a quick fix that doesn't involve recompiling is to change this line in spacy/language.py:

source_nlps[model] = util.load_model(model)

to

source_nlps[model] = util.load_model(model, vocab=nlp.vocab)

This may cause bugs or a lack of helpful warnings in other setups, so be wary, though. I'd only recommend doing this temporarily and just for this use with prodigy.

If you don't mind installing spacy from source, a better fix is here (still subject to review, though, so it might not be the final version):

github.com/explosion/spaCy

Sync string store in components sourced in configs

explosion:master ← adrianeboyd:bugfix/strings-in-sourced-components

opened 11:08AM - 30 Sep 21 UTC

adrianeboyd

+14 -9

## Description  When loading a sourced component from a config, load with a temporary vocab with the same string store so that the string store is synced in the component's internal references. The string store references are not synced for components loaded with `Language.add_pipe(source=)` because the pipelines are already loaded and not necessarily with the same vocab. A warning could be added in `Language.create_pipe_from_source` that it may be necessary to save and reload before training, but it's a rare enough case that this kind of warning may be too noisy overall. ### Types of change  Bug fix. ## Checklist  - [x] I confirm that I have the right to submit this contribution under the project's MIT license. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

kswanjitsu · September 30, 2021, 7:37pm

Holy moly, temp solution worked! Thank you, you guys are the best! I will train now and then install from source.

Topic		Replies	Views
Error while running terms.teach (E018) spacy , terms , solved	14	2178	September 5, 2021
ner correct with prodigy 1.11.8 ner	11	533	December 30, 2022
Basic question about Prodigy annotations and model training. usage , ner	12	753	January 18, 2019
train ner dataset -> ValueError: too many values to unpack ner , done	6	2627	January 10, 2020
Does Prodigy load pre-annotated data? usage , ner , solved	23	2642	October 25, 2018

E018 when fine-tuning parser

Related topics