Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy

nlp.tokenizer.rules = rules
infixes = list(nlp.Defaults.infixes)
infixes.extend([r"-",r"kt",r"mt", r"$", r":", r"\€", r"\£", r"\¥", r"+", r"[", r"]"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
tok2vec = nlp.add_pipe("tok2vec")

  1. I have added some special cases in tokenizer and generated a base model to annotate 100mt as 100 and mt in prodigy.

2.Used data-to-spacy with the base model I have generated above to get the .spacy files

3.At Last Trained the model using spacy train --code command with same custom tokenizer component but the results seems to be not proper.

Can you help me here that am going in a right direction or not?

Welcome to the forum @Bhargavi1144 :wave:

What I think is happening is that spaCy was not explicitly instructed to use the custom tokenizer.
Unfortunately, the sourcing of the tokenizer from the base model is not automated (yet) and we need to instruct spaCy where to source the tokenizer from via the config file.

This thread provides step-by-step instruction on how to properly source a custom tokenizer for training:

Let me know if you need extra support on top of that tutorial :slight_smile:

Hello,Thanks for the quick response!

I tried the same steps as mentioned in the above documentation.
I could able to generate the config.cfg and training sets using custom tokenizer.

But when I tried to run the spacy train command ,its giving me the below exception:

Though I don't see any messages regarding copying the custom tokenizer as well under Initializing pipeline section.

Can you help me with this step if I missing anything here?

Hi @Bhargavi1144 ,

The error you are getting is likely related to the component not being correctly initialized.
And, indeed, in the "Initializing pipeline" section of the console log, there should be information on copying the tokenizer from a pipeline.

Could you share the config file that you use for train command?

train = null
dev = null
vectors = null
init_tok2vec = null

gpu_allocator = null
seed = 0

lang = "en"
pipeline = ["tok2vec","ner"]
disabled =
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}


factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

architectures = "spacy.Tok2VecListener.v1"
width = 96
upstream = "*"

source = "assets/base_model_new"

readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0

readers = "prodigy.NERCorpus.v1"
datasets = ["asia_steel_rebar_volume"]
eval_datasets =
default_fill = "outside"
incorrect_key = "incorrect_spans"

seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components =
annotating_components =
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001



vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null


callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "assets/base_model_new"
vocab = "assets/base_model_new"



The above one is the config data I have used.

assets/base_model_new --> This is the model I have generated using custom tokenization and used the same while generating .spacy files as well.

Hi @Bhargavi1144,

There seem to be differences in how "spacy.copy_from_base_model.v1" works in spaCy v3.6 and v3.7, let me check with the spaCy team and get back to you.

Looking at the config, you are on spaCy 3.7, right?

Hi @Bhargavi1144 ,

While we are looking into the copy_from_base_model, perhaps you could try downgrading spaCy to 3.6 as a workaround. The sourcing should work with this version.

Hello again @Bhargavi1144 ,

I've looked into the issue with the help of the spaCy team and my first intuition was not confirmed. The sourcing works correctly in 3.7.X.
I also tried your config substituting the base model and the dataset and it runs without problems.
The issue is that your tok2vec is not intitialized. Looking back at your original snippet:

tok2vec = nlp.add_pipe("tok2vec")

you can see that your are actually adding tok2vec and directly save it, without intializing it. And then in the train cfg you source from this model:

source = "assets/base_model_new"

This errors out with this, admittedly, uninformative error because tok2vec is, indeed, not initialized.
To use the blank tok2vec you should pass a factory instead so:

factory = "tok2vec"

Could you try training w/ the tok2vec factory instead of sourced tok2vec?
And for the logging message about copying of the tokenizer to appear you should run spacy train with --verbose flag - sorry I missed that in the original post!


Thanks for checking the issue!

Instead of copy_from_base_model approach I did try using below approach.

Mentioned below property in config.cfg:
@callbacks = "customize_tokenizer"

Train command I have used while training:
python -m spacy train assets/config.cfg --paths.train train.spacy --paths.dev dev.spacy --output model --code assets/modify_tokenizer.py

This resolved the issue and model got generated based on the custom tokenization.

But I will try to check copy_from_base_model approach as well with your suggestion as well.

Thanks alot for checking the issues quickly :slight_smile:

1 Like