ValueError: could not broadcast input array from shape (128) into shape (96)

Hi Ines,

I'm trying to tag Entities in NEWS articles and train using prodigy. [This is just a trial/test run using prodigy, so using minimum number of example]

I have used below command
"prodigy ner.manual ner_v12 en_core_web_sm prodigy_format_ner_input_v12_sample.jsonl --label Role,Department"

This worked for me in tagging.

Then i was trying to use "Train" via prodigy using the below command:
"prodigy train ner ner_v12 en_core_web_sm --init-tok2vec ./tok2vec_cd8_model289.bin --output ./tmp_model --eval-split 0.2"

I'm getting an error as:

Loaded model 'en_core_web_sm'
Created and merged data for 489 total examples
Using 392 train / 97 eval (split 20%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 10
:heavy_check_mark: Initializing with tok2vec weights ./tok2vec_cd8_model289.bin

Traceback (most recent call last):
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/site-packages/prodigy/main.py", line 52, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/site-packages/prodigy/recipes/train.py", line 130, in train
load_pretrained_tok2vec(pipe, init_tok2vec, require=True)
File "cython_src/prodigy/util.pyx", line 520, in prodigy.util.load_pretrained_tok2vec
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 376, in from_bytes
copy_array(dest, param[b"value"])
File "/home/merit/anaconda3/envs/prodigy/lib/python3.7/site-packages/thinc/neural/util.py", line 145, in copy_array
dst[:] = src
ValueError: could not broadcast input array from shape (128) into shape (96)

Please let me know how to proceed further!

The problem here is that the base model doesn't match: if you're using tok2vec weights trained with vectors, you also need to use those vectors as the base model during training. The tok2vec_cd8_model289.bin weights were pretrained using the en_vectors_web_lg package, so this should be the base model for training (not en_core_web_sm, which has no vectors at all).

You want the en_vectors_web_lg, which are the large word vectors (not the core model trained with a subset of those vectors).