good configs for spacy pretraining

I'm interested in training a custom tok2vec model like the reddit one mentioned here: Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning - YouTube

can you recommend some good parameters to start with if I want to train my own pretrain model with different data? Is the code to pretrain the reddit model shared somewhere? That could help where I can swap out for a different dataset to try.

Thank you.

Hi! The pretraining all happens via the spacy pretrain command that takes the raw texts as input and outputs pretrained tok2vec weights. If you're just getting started, I'd recommend running it with the default configuration and see how you go. You can find more details here:

The weights we used for the NER tutorial were trained for ~8 hours on GPU, using the en_vectors_web_lg vectors as the output target (i.e. what is predicted during pretraining). In spaCy v3, the default pretraining objective is a character-based objective, so you're pretraining by predicting the start/end characters of the word, which is more efficient than predicting the whole word (as it's commonly done in language modelling).

I tried to use the command: python -m spacy pretrain ./data.jsonl en_vectors_web_lg ./pretrained-model,

but I get the error: ✘ Invalid config override './pretrained-model': name should start with

It looks like you might be using the command for spaCy v2 in spaCy v3? So double-check that your environment actually has spaCy v2 installed or set up a new env for it so you can run the v2 commands.

I'm using spacy v2, because the tutorials for entity linking is also in spacy v2. What is the command in spacy v2?

Ah okay, you should definitely look at the spaCy v2 docs then. See here:

Hi, it's been a while, but I'm doing pretraining now with spacy v3. In the config file, where do I I put en_vectors_web_lg vectors? Is it in paths.vectors? Do I download it and give it a path?

You can use spaCys init config command with --pretraining to auto-generate a config for pretraining: You can then edit the pretraining block if you want to configure the settings. See here for details on the settings and what they mean:

The default pretraining config in v3 doesn't use word vectors and instead predicts the start/end characters of the words (instead of the vector or the whole word). But in general, if you are using word vectors in your config, you should point to a path or an installed package name (basically, anything you can load with spacy.load).