can you recommend some good parameters to start with if I want to train my own pretrain model with different data? Is the code to pretrain the reddit model shared somewhere? That could help where I can swap out for a different dataset to try.
Hi! The pretraining all happens via the spacy pretrain command that takes the raw texts as input and outputs pretrained tok2vec weights. If you're just getting started, I'd recommend running it with the default configuration and see how you go. You can find more details here:
The weights we used for the NER tutorial were trained for ~8 hours on GPU, using the en_vectors_web_lg vectors as the output target (i.e. what is predicted during pretraining). In spaCy v3, the default pretraining objective is a character-based objective, so you're pretraining by predicting the start/end characters of the word, which is more efficient than predicting the whole word (as it's commonly done in language modelling).
It looks like you might be using the command for spaCy v2 in spaCy v3? So double-check that your environment actually has spaCy v2 installed or set up a new env for it so you can run the v2 commands.
Hi, it's been a while, but I'm doing pretraining now with spacy v3. In the config file, where do I I put en_vectors_web_lg vectors? Is it in paths.vectors? Do I download it and give it a path?
The default pretraining config in v3 doesn't use word vectors and instead predicts the start/end characters of the words (instead of the vector or the whole word). But in general, if you are using word vectors in your config, you should point to a path or an installed package name (basically, anything you can load with spacy.load).