can you recommend some good parameters to start with if I want to train my own pretrain model with different data? Is the code to pretrain the reddit model shared somewhere? That could help where I can swap out for a different dataset to try.
Hi! The pretraining all happens via the spacy pretrain command that takes the raw texts as input and outputs pretrained tok2vec weights. If you're just getting started, I'd recommend running it with the default configuration and see how you go. You can find more details here:
The weights we used for the NER tutorial were trained for ~8 hours on GPU, using the en_vectors_web_lg vectors as the output target (i.e. what is predicted during pretraining). In spaCy v3, the default pretraining objective is a character-based objective, so you're pretraining by predicting the start/end characters of the word, which is more efficient than predicting the whole word (as it's commonly done in language modelling).
It looks like you might be using the command for spaCy v2 in spaCy v3? So double-check that your environment actually has spaCy v2 installed or set up a new env for it so you can run the v2 commands.
Hi, it's been a while, but I'm doing pretraining now with spacy v3. In the config file, where do I I put en_vectors_web_lg vectors? Is it in paths.vectors? Do I download it and give it a path?
The default pretraining config in v3 doesn't use word vectors and instead predicts the start/end characters of the words (instead of the vector or the whole word). But in general, if you are using word vectors in your config, you should point to a path or an installed package name (basically, anything you can load with spacy.load).
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
ℹ Loading config from: my-gpu.cfg
✔ Created output directory: output
✔ Saved config file in the output directory
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
...
File "/home/ubuntu/.local/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
return self._func(self, X, is_train=False)[0]
File "/home/ubuntu/.local/lib/python3.8/site-packages/thinc/layers/list2array.py", line 22, in forward
lengths = model.ops.asarray1i([len(x) for x in Xs])
TypeError: 'FullTransformerBatch' object is not iterable
BTW, I have 8 GPUs but it still uses CPU, how can I also change that?
Thanks for your questions. Since your problems are more spaCy than Prodigy, can you post your issue on the spaCy GitHub discussion forum?
The spaCy core team monitors that forum, not this forum. They have much more expertise on handling GPU's and pre-training.
The problem here is that since your using en_core_web_trf, it doesn't use the spaCy vectors. Instead, the transformers serves that purpose. That's why if you're using en_core_web_trf, you wouldn't reference en_vectors_web_lg.