Spacy pretrain best practices

I have pretrained weights that I got from running spacy pretrain using spacy 2.1.4 that I would like to use in an experiment. I passed the model path to the --init-tok2vec in textcat.batch-train in prodigy 1.8 but am seeing the following error:

Traceback (most recent call last):                                                                                                                             
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/prodigy/recipes/textcat.py", line 254, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 232, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 249, in prodigy.models.textcat.TextClassifier._update
  File "pipes.pyx", line 933, in spacy.pipeline.pipes.TextCategorizer.update
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 53, in continue_update
    gradient = callback(gradient, sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/api.py", line 269, in finish_update
    d_X = bp_layer(layer.ops.flatten(d_seqs_out, pad=pad), sgd=sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 53, in continue_update
    gradient = callback(gradient, sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/api.py", line 354, in uniqued_bwd
    d_uniques = bp_Y_uniq(dY_uniq, sgd=sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 53, in continue_update
    gradient = callback(gradient, sgd)
  File "ops.pyx", line 100, in thinc.neural.ops.Ops.dropout.wrap_backprop.finish_update
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 68, in finish_update
    return backprop_child(d_xhat, sgd)
  File "/Users/james/Personal/Prodigy/prodigy_venv/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 87, in finish_update
    self.d_W += d_W.reshape((self.nO, self.nP, self.nI))
ValueError: cannot reshape array of size 110592 into shape (96,3,480)

Are there other considerations I need to make when pretraining so the weights will work with prodigy?

with regard to textcat.batch-train, when using -t2v, do we need to specify a particular model? do blank model, en_core_web_sm/md/lg en_core_vectors_lg are all fine?

when using blank model and en_core_vectors_lg I get the following error

File “_packer.pyx”, line 285, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 291, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 288, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 235, in srsly.msgpack._packer.Packer._pack
File “_packer.pyx”, line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize ‘spacy.vocab.Vocab’ object

thank you
kind regards

claudio nespoli

Can you run spacy validate and just check you have the right model version?

For pretraining to work, you need:

  1. If you want to train with vectors afterwards, you need to use the --use-vectors argument
  2. The same hyper-parameters between pretraining and training. Prodigy should take care of this.
  3. When you do batch training, you have to start off with a blank model, or a model with vectors. You can’t start off with a model like en_core_web_sm.

Could you give the commands you’re running, to make it easier to sort this out?

from spacy validate I get following results:

TYPE      NAME                MODEL               VERSION
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0     ?
package   en-core-web-sm      en_core_web_sm      2.1.0     ?
package   en-core-web-md      en_core_web_md      2.1.0     ?
package   en-core-web-lg      en_core_web_lg      2.1.0     ?

I checked config.json from pre-training folder and I see

“use_vectors”:false

list of command I use:

python -m spacy pretrain ./raw_data.jsonl en_vectors_web_lg ./pretrained-model

python -m prodigy textcat.batch-train dataset_train en_vectors_web_lg --output "./model" -e dataset_eval -t2v "./pretrained-model/model197.bin"

when I run

python -m prodigy textcat.batch-train dataset_train en_vectors_web_lg --output “./model” -e dataset_eval -t2v “./pretrained-model/model197.bin”

python -m prodigy textcat.batch-train dataset_train --output “./model” -e dataset_eval -t2v “./pretrained-model/model197.bin”

I get this, even without the argument -t2v

File “_packer.pyx”, line 285, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 291, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 288, in srsly.msgpack._packer.Packer.pack
File “_packer.pyx”, line 235, in srsly.msgpack._packer.Packer._pack
File “_packer.pyx”, line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize ‘spacy.vocab.Vocab’ object

@honnibal Here’s what I did:

For pretraining -

python -m spacy pretrain data/question_responses_individual_train.jsonl en_vectors_web_lg models/ --use-vectors

For training -

prodigy textcat.batch-train all_symptoms_question_response_individual_train en_vectors_web_lg --init-tok2vec ../../code/symptom-experiments/models/model813.bin 

I am using spacy 2.1.4 and prodigy 1.8.0.

The output of python -m spacy validate:

TYPE      NAME                MODEL               VERSION                            
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0   ✔
package   en-core-web-sm      en_core_web_sm      2.1.0   ✔
package   en-core-web-lg      en_core_web_lg      2.1.0   ✔

using --use-vector during pre_training, I get durint the textca.batch-train same error, regardless which model I select

```
ValueError: cannot reshape array of size 110592 into shape (96,3,480)
```
1 Like

I get the same error when using blank model and None for all other options. I tried textcat.batch-train recipe.
The error:

...
    best_model = nlp.to_bytes()
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 817, in to_bytes
    return util.to_bytes(serializers, exclude)
  File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 601, in to_bytes
    serialized[key] = getter()
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 815, in <lambda>
    serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
  File "pipes.pyx", line 1163, in spacy.pipeline.pipes.Sentencizer.to_bytes
  File "/usr/local/lib/python3.6/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/usr/local/lib/python3.6/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.vocab.Vocab' object

Output of spacy validate:


====================== Installed models (spaCy v2.1.4) ======================
spaCy installation: /usr/local/lib/python3.6/site-packages/spacy

TYPE      NAME                MODEL               VERSION                            
package   en-vectors-web-lg   en_vectors_web_lg   2.1.0   ✔
package   en-core-web-sm      en_core_web_sm      2.1.0   ✔
package   en-core-web-md      en_core_web_md      2.1.0   ✔
package   en-core-web-lg      en_core_web_lg      2.1.0   ✔
link      en                  en_core_web_sm      2.1.0   ✔

Thanks,
Ati

`

Thanks for the reports. It looks like there must be a bug here – will try to get this fixed and push a v1.8.2 as soon as I can.

1 Like

I just reproduced this and here’s the quick summary to make it easier for us to fix it. Both setups used the en_vectors_web_lg models.

Case 1: Pre-train without --use-vectors and batch train with the artifact. Error raised when calling nlp.to_bytes to serialize best model:

TypeError: can not serialize 'spacy.vocab.Vocab' object

Case 2: Pre-train with --use-vectors and batch train with the artifact. Error raised when calling TextCategorizer.update:

ValueError: cannot reshape array of size 110592 into shape (96,3,480)

We just released v1.8.2, which fixes the serialization issue of the sentencizer and ensures that hyperparameters from pretraining are passed to textcat.batch-train, which should resolve the shape incompatibility issue.

Btw, one quick note about the --use-vectors flag: If you use --use-vectors in the spacy pretrain command, it’s important to have a model with vectors as an argument to textcat.batch-train. (So basically, there needs to be an input model that has the word vectors loaded – those vectors do not come from the pre-trained tok2vec artifact.)