prodigy train using pretrained model

Hello,

I use the spacy pretrain on raw text to get weights model999.bin. However I got the following dimension error when running prodigy train.

image

  1. Can I train on en_core_web_lg when the weights are trained on en_vectors_web_lg?

  2. Where is the shape 480 come from when there are 769 examples in total?

  3. Should I pretrain the raw text on the 616 training set or 769 in total?

  4. Does the pretrain always stop at 999 if not specified?

Thanks.

No, that's likely the problem here. If you've trained the tok2vec weights using the en_vectors_web_lg package (by predicting the vector of the next word), you also need to use those vectors during training.

Those are the shapes of the vectors (vectors model vs. tok2vec weights).

Ideally, you should want to pretrain on a very large sample of raw text. Like, billions of words (Reddit, CommonCrawl, Wikipedia etc.). Just training on a few hundred of your examples likely won't be very effective.

The default number of iterations (configurable with --n-iter) is 1000. If you're training on a large sample of raw text, you typically wouldn't train for that many iterations – if your corpus is large enough, you can even just stream it in and train on it until the model stops improving. There's not necessarily an advantage in looping over the same data multiple times.

Thanks Ines. Here is the error message I got by changing the base model to en_vectors_web_lg

Another thing that confuses me is why is token_vector_width 96 and pretrained_dims 300? I did not set it up.

I think the problem here is that you didn't pretrain with the setting to use the word vectors as features. If you didn't add the --use-vectors flag, the model won't expect to have the word vectors during training. So for the model you've pretrained, try setting blank:en as the model rather than en_vectors_web_lg.

These are different dimensions in the model. Unfortunately there's no way to talk about this especially clearly, because the neural network model has many layers of activations on the way to finally assigning a vector to each token. There isn't really an elegant way to distinguish those activations from each other terminologically. So, the word vectors you load in, the big static table, are just one feature used to compute the activation for each word individually. The other features are also vectors, representing the word's lower-cased form, prefix, suffix and shape. All of those vectors are mixed together in a feed-forward layer to produce another vector, and then a CNN is run to mix in information from the surrounding context, outputting the thing we call "token vectors".

So:

  • The pretrained_dims is the width of the big static table, produced by an algorithm like word2vec or GloVe. It was named that before the spacy pretrain command was around.
  • The token_vector_width is the width of the vectors output by the CNN.

Again, I regret how confusing this all is, but it's a dilemma that other deep learning models face as well.