Spacy pretrain best practices

I have pretrained weights that I got from running spacy pretrain using spacy 2.1.4 that I would like to use in an experiment. I passed the model path to the --init-tok2vec in textcat.batch-train in prodigy 1.8 but am seeing the following error:

Traceback (most recent call last):                                                                                                                             
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/prodigy/recipes/textcat.py", line 254, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 232, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 249, in prodigy.models.textcat.TextClassifier._update
  File "pipes.pyx", line 933, in spacy.pipeline.pipes.TextCategorizer.update
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 53, in continue_update
    gradient = callback(gradient, sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/api.py", line 269, in finish_update
    d_X = bp_layer(layer.ops.flatten(d_seqs_out, pad=pad), sgd=sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 53, in continue_update
    gradient = callback(gradient, sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/api.py", line 354, in uniqued_bwd
    d_uniques = bp_Y_uniq(dY_uniq, sgd=sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 53, in continue_update
    gradient = callback(gradient, sgd)
  File "ops.pyx", line 100, in thinc.neural.ops.Ops.dropout.wrap_backprop.finish_update
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 68, in finish_update
    return backprop_child(d_xhat, sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 87, in finish_update
    self.d_W += d_W.reshape((self.nO, self.nP, self.nI))
ValueError: cannot reshape array of size 110592 into shape (96,3,480)

Are there other considerations I need to make when pretraining so the weights will work with prodigy?

with regard to textcat.batch-train, when using -t2v, do we need to specify a particular model? do blank model, en_core_web_sm/md/lg en_core_vectors_lg are all fine?

when using blank model and en_core_vectors_lg I get the following error

File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.vocab.Vocab' object

thank you
kind regards

claudio nespoli

Can you run spacy validate and just check you have the right model version?

For pretraining to work, you need:

  1. If you want to train with vectors afterwards, you need to use the --use-vectors argument
  2. The same hyper-parameters between pretraining and training. Prodigy should take care of this.
  3. When you do batch training, you have to start off with a blank model, or a model with vectors. You can’t start off with a model like en_core_web_sm.

Could you give the commands you’re running, to make it easier to sort this out?

from spacy validate I get following results:

TYPE      NAME                MODEL               VERSION
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0     ?
package   en-core-web-sm      en_core_web_sm      2.1.0     ?
package   en-core-web-md      en_core_web_md      2.1.0     ?
package   en-core-web-lg      en_core_web_lg      2.1.0     ?

I checked config.json from pre-training folder and I see

"use_vectors":false

list of command I use:

python -m spacy pretrain ./raw_data.jsonl en_vectors_web_lg ./pretrained-model

python -m prodigy textcat.batch-train dataset_train en_vectors_web_lg --output "./model" -e dataset_eval -t2v "./pretrained-model/model197.bin"

when I run

python -m prodigy textcat.batch-train dataset_train en_vectors_web_lg --output "./model" -e dataset_eval -t2v "./pretrained-model/model197.bin"

python -m prodigy textcat.batch-train dataset_train --output "./model" -e dataset_eval -t2v "./pretrained-model/model197.bin"

I get this, even without the argument -t2v

File “_packer.pyx”, line 285, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 291, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 288, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 235, in srsly.msgpack._packer.Packer._pack
File “_packer.pyx”, line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize ‘spacy.vocab.Vocab’ object

@honnibal Here’s what I did:

For pretraining -

python -m spacy pretrain data/question_responses_individual_train.jsonl en_vectors_web_lg models/ --use-vectors

For training -

prodigy textcat.batch-train all_symptoms_question_response_individual_train en_vectors_web_lg --init-tok2vec ../../code/symptom-experiments/models/model813.bin 

I am using spacy 2.1.4 and prodigy 1.8.0.

The output of python -m spacy validate:

TYPE      NAME                MODEL               VERSION                            
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0   ✔
package   en-core-web-sm      en_core_web_sm      2.1.0   ✔
package   en-core-web-lg      en_core_web_lg      2.1.0   ✔

using --use-vector during pre_training, I get durint the textca.batch-train same error, regardless which model I select

```
ValueError: cannot reshape array of size 110592 into shape (96,3,480)
```
1 Like

I get the same error when using blank model and None for all other options. I tried textcat.batch-train recipe.
The error:

...
    best_model = nlp.to_bytes()
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 817, in to_bytes
    return util.to_bytes(serializers, exclude)
  File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 601, in to_bytes
    serialized[key] = getter()
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 815, in <lambda>
    serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
  File "pipes.pyx", line 1163, in spacy.pipeline.pipes.Sentencizer.to_bytes
  File "/usr/local/lib/python3.6/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/usr/local/lib/python3.6/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.vocab.Vocab' object

Output of spacy validate:


====================== Installed models (spaCy v2.1.4) ======================
spaCy installation: /usr/local/lib/python3.6/site-packages/spacy

TYPE      NAME                MODEL               VERSION                            
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0   ✔
package   en-core-web-sm      en_core_web_sm      2.1.0   ✔
package   en-core-web-md      en_core_web_md      2.1.0   ✔
package   en-core-web-lg      en_core_web_lg      2.1.0   ✔
link      en                  en_core_web_sm      2.1.0   ✔

Thanks,
Ati

`

Thanks for the reports. It looks like there must be a bug here – will try to get this fixed and push a v1.8.2 as soon as I can.

1 Like

I just reproduced this and here’s the quick summary to make it easier for us to fix it. Both setups used the en_vectors_web_lg models.

Case 1: Pre-train without --use-vectors and batch train with the artifact. Error raised when calling nlp.to_bytes to serialize best model:

TypeError: can not serialize 'spacy.vocab.Vocab' object

Case 2: Pre-train with --use-vectors and batch train with the artifact. Error raised when calling TextCategorizer.update:

ValueError: cannot reshape array of size 110592 into shape (96,3,480)

We just released v1.8.2, which fixes the serialization issue of the sentencizer and ensures that hyperparameters from pretraining are passed to textcat.batch-train, which should resolve the shape incompatibility issue.

Btw, one quick note about the --use-vectors flag: If you use --use-vectors in the spacy pretrain command, it’s important to have a model with vectors as an argument to textcat.batch-train. (So basically, there needs to be an input model that has the word vectors loaded – those vectors do not come from the pre-trained tok2vec artifact.)

@ines, @honnibal Hi, I tried to use spacy pretrain as suggested in this tutorial https://explosion.ai/blog/sense2vec-reloaded#ner-results . Do you suggest to use en_vectors_web_lg as vector model or use my own vector model got from FastText using my raw text. I'm using the same raw text file as a parameter in spacy pretrain?

Depends on what you're doing – but if your vectors are custom and more domain-specific, you might see better results if you're using your own. Just make sure they're large enough so you have enough coverage. You could also try spacy pretrain with both and then compare the results.

Using the same raw text for both the vectors and pretraining should be no problem.

@ines I used pre-train with --use-vectors and now I can only use en_core_web_lg as a model for ner.batch-train.
blank:en or my own model doesn't work anymore. I get the following error:

Loaded model en
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/anaconda3/lib/python3.7/site-packages/prodigy/main.py", line 380, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/opt/anaconda3/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/opt/anaconda3/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 571, in batch_train
hyper_params = read_pretrain_hyper_params(init_tok2vec, require=False)
File "cython_src/prodigy/util.pyx", line 524, in prodigy.util.read_pretrain_hyper_params
File "/opt/anaconda3/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 57, in read_msgpack
msg = msgpack.load(f, raw=False, use_list=use_list)
File "/opt/anaconda3/lib/python3.7/site-packages/srsly/msgpack/init.py", line 50, in unpack
return _unpack(stream, **kwargs)
File "_unpacker.pyx", line 213, in srsly.msgpack._unpacker.unpack
File "_unpacker.pyx", line 203, in srsly.msgpack._unpacker.unpackb
ValueError: Unpack failed: incomplete input

Could you share the full command? That msgpack errror is definitely stranger.

If you're initializing your model with pretrained weights and --init-tok2vec, you do have to use the same word vectors that you used during pretraining, so if you pretrain with en_core_web_lg, that also needs to be the base model later on.

ah, ok. Does that mean since I used en_vectors_web_lg for pre-training I can only use en_core_web_lg for training?
But I got exactly the same result when training with ner.batch-train with or without --init-tok2vec parameter. Does it mean a model is not using pre-trained weights at all?

prodigy ner.batch-train my_data_set en_core_web_lg --label "my_label" --n-iter 10 --init-tok2vec pre-train/model320.bin --output models/ --batch-size 4 --eval-split 0.20 --dropout 0.2

If you used the en_vectors_web_lg vectors during pretraining, then those vectors also need to be available during training, yes. So that's the base model you should use. (The en_core_web_lg model is smaller and only has a subset of the vectors. So you might end up with less precise vectors and potentially worse results.) What happens if you use the en_vectors_web_lg model as the base model?

I get the same error when using blank:en, en_vectors_web_lg or my own model from FastText.

ValueError: Unpack failed: incomplete input

It looks like I can only use en_core_web_lg and get the same result either using --init-tok2vec parameter for training or not.

Also, I just tried prodigy textcat.batch-train with en_core_web_lg, I got the same error. So I can not use any model for textcat.