good configs for spacy pretraining

I'm interested in training a custom tok2vec model like the reddit one mentioned here: Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning - YouTube

can you recommend some good parameters to start with if I want to train my own pretrain model with different data? Is the code to pretrain the reddit model shared somewhere? That could help where I can swap out for a different dataset to try.

Thank you.

1 Like

Hi! The pretraining all happens via the spacy pretrain command that takes the raw texts as input and outputs pretrained tok2vec weights. If you're just getting started, I'd recommend running it with the default configuration and see how you go. You can find more details here:

The weights we used for the NER tutorial were trained for ~8 hours on GPU, using the en_vectors_web_lg vectors as the output target (i.e. what is predicted during pretraining). In spaCy v3, the default pretraining objective is a character-based objective, so you're pretraining by predicting the start/end characters of the word, which is more efficient than predicting the whole word (as it's commonly done in language modelling).

1 Like

I tried to use the command: python -m spacy pretrain ./data.jsonl en_vectors_web_lg ./pretrained-model,

but I get the error: ✘ Invalid config override './pretrained-model': name should start with

It looks like you might be using the command for spaCy v2 in spaCy v3? So double-check that your environment actually has spaCy v2 installed or set up a new env for it so you can run the v2 commands.

I'm using spacy v2, because the tutorials for entity linking is also in spacy v2. What is the command in spacy v2?

Ah okay, you should definitely look at the spaCy v2 docs then. See here:

Hi, it's been a while, but I'm doing pretraining now with spacy v3. In the config file, where do I I put en_vectors_web_lg vectors? Is it in paths.vectors? Do I download it and give it a path?

You can use spaCys init config command with --pretraining to auto-generate a config for pretraining: https://spacy.io/api/cli#init-config You can then edit the pretraining block if you want to configure the settings. See here for details on the settings and what they mean: https://spacy.io/api/data-formats#config-pretraining

The default pretraining config in v3 doesn't use word vectors and instead predicts the start/end characters of the words (instead of the vector or the whole word). But in general, if you are using word vectors in your config, you should point to a path or an installed package name (basically, anything you can load with spacy.load).

1 Like

I tried this

python -m spacy init config --gpu my-gpu.cfg --pretraining

then changed roberta to bert-base-uncased like this (because I use BERT in other code)

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "bert-base-uncased"

Then changed pretrain block to look like this:

[pretraining]
max_epochs = 1000
dropout = 0.2
n_save_every = null
n_save_epoch = null
component = "transformer"
layer = ""
corpus = "corpora.pretrain"

so i changed to transformer as suggested here Can I use pretraining with GPU in V3? · Issue #6973 · explosion/spaCy · GitHub

and tried running

python -m spacy pretrain my-gpu.cfg ./output --paths.raw_text all.jsonl 

Which failed with

ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
ℹ Loading config from: my-gpu.cfg
✔ Created output directory: output
✔ Saved config file in the output directory
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
...
  File "/home/ubuntu/.local/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/ubuntu/.local/lib/python3.8/site-packages/thinc/layers/list2array.py", line 22, in forward
    lengths = model.ops.asarray1i([len(x) for x in Xs])
TypeError: 'FullTransformerBatch' object is not iterable

BTW, I have 8 GPUs but it still uses CPU, how can I also change that?

I'm newbie I've tried changing pretrain block to this

[pretraining]
max_epochs = 1000
dropout = 0.2
n_save_every = null
n_save_epoch = null
component = "en_vectors_web_lg"
layer = ""
corpus = "corpora.pretrain"

but then I got

No component 'en_vectors_web_lg' found in pipeline. Available names: ['transformer', 'tagger', 'parser', 'ner']"

I copied stuff from here config.cfg · spacy/pl_core_news_lg at main

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [5000,1000,2500,2500,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

and this entire config got pretraining started , but I don't see references to those en_vectors_web_lg am I pretraining the right thing here?

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
raw_text = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.tagger]
factory = "tagger"
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [5000,1000,2500,2500,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.pretrain]
@readers = "spacy.JsonlCorpus.v1"
path = ${paths.raw_text}
min_length = 5
max_length = 500
limit = 0

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
tag_acc = 0.33
dep_uas = 0.17
dep_las = 0.17
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0
ents_f = 0.33
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]
max_epochs = 1000
dropout = 0.2
n_save_every = null
n_save_epoch = null
component = "tok2vec"
layer = ""
corpus = "corpora.pretrain"

[pretraining.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 3000
discard_oversize = false
tolerance = 0.2
get_length = null

[pretraining.objective]
@architectures = "spacy.PretrainCharacters.v1"
maxout_pieces = 3
hidden_size = 300
n_characters = 4

[pretraining.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

like this:

python -m spacy pretrain my-gpu.cfg ./output --paths.raw_text all.jsonl --gpu-id 0
ℹ Using GPU: 0                                                                                              
ℹ Loading config from: my-gpu.cfg             
✔ Created output directory: output                                                                          
✔ Saved config file in the output directory     
                                                                                                            
============== Pre-training tok2vec layer - starting at epoch 0 ==============                              
  #      # Words   Total Loss     Loss    w/s                                                                                                                                                                            
  0        11941   47274.0996    47274   39583                                                              
  0        23838   91825.1826    44551   47705
  0        35807   135207.554    43382   48454
  0        47786   177913.052    42705   49521
  0        59767   219397.118    41484   35281
  0        71717   260061.124    40664   51929
  0        83650   299740.152    39679   52174

hi @ysz!

Thanks for your questions. Since your problems are more spaCy than Prodigy, can you post your issue on the spaCy GitHub discussion forum?

The spaCy core team monitors that forum, not this forum. They have much more expertise on handling GPU's and pre-training.

The problem here is that since your using en_core_web_trf, it doesn't use the spaCy vectors. Instead, the transformers serves that purpose. That's why if you're using en_core_web_trf, you wouldn't reference en_vectors_web_lg.

1 Like