Training after annotating with custom tokenizer

Hi, I used a custom_tokenizer to annotate the data, I have saved the model using nlp.to_disk.

def custom_tokenizer(nlp):

    infix_re =  compile_infix_regex(nlp.Defaults.infixes + [r'(?<=[0-9])(?=[a-zA-Z])', r'\.'])
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
    return Tokenizer(nlp.vocab,,
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)


Here are the details of the base-model used for labelling, I want to make sure what should be the steps for execution for training. I'm new to spacy and would need some help with the config file.
Also while using data-to-spacy how do I make sure my annotations are being saved correctly. I have read a couple of issues regarding data-to-spacy and custom_tokenizer and there are some steps to be taken while saving the annotations.

Hi @AakankshaP,

Apologies for the late reply!

Assuming you have already annotated your dataset with your custom tokenizer, the next step would be to generate a config file so that we can control the tokenizer used for data generation and the training.

Prodigy provides a utility for generating the base config and we'll use it as a starting point.
So, assuming your annotated dataset is named custom_tok_annotations and that your custom pipeline is stored in custom_tokenizer directory, you can obtain the base config by running:

python -m prodigy spacy-config custom_tok.cfg --ner custom_tok_annotations --base-model custom_tokenizer

The sourcing of the custom tokenizer from the base model is not automated, though so, before generating the training data, we'll need to provide the instruction in the config.
If you inspect the generated config (custom_tok.cfg), you'll see that it is organized by sections relating to different parts of the pipeline and spaCy has excellent documentation explaining the details. For our immediate purpose we need to modify the [initialize] section by adding [initialize.before_init] section specifying where spaCy should source the tokenizer from before initializing the pipeline.
In the generated custom_tok.cfg, you would add the following:

# Inside your .cfg file
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "custom_tokenizer"
vocab = "custom_tokenizer"

You'll also have to delete before_init = null from the [initialize] section so that it doesn't conflict with your new setting. So the complete, edited [initialize]and [initialize.before_init]sections will look like so:

# Inside your .cfg file

vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "custom_tokenizer"
vocab = "custom_tokenizer"

Note that for that to work the directory with your custom tokenizer (custom_tokenizer in the example) should be in the pwd or the pipeline should be installed as package with this name.

With the edited custom_tok.cfg, we can now procede to generate the train and dev datasets with data-to-spacy
(assuming the name of the dataset is custom_tok and the task is ner and that the custom pipeline is in the directory custom_tokenizer)

python -m prodigy data-to-spacy custom_tok_output --ner custom_tok --config custom_tok.cfg --base-model custom_tokenizer

This will also generate training config in the same directory as the data so you can proceed with training by running:

 python -m spacy train custom_tok_output/config.cfg --paths.train custom_tok_output/train.spacy custom_tok_output/dev.spacy -o . --verbose

When training, you should see that the custom tokenizer is being copied:

=========================== Initializing pipeline ===========================
[2023-10-20 17:31:26,122] [INFO] Set up nlp object from config
[2023-10-20 17:31:26,132] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2023-10-20 17:31:26,132] [INFO] Resuming training for: ['ner', 'tok2vec']
[2023-10-20 17:31:26,139] [INFO] Copying tokenizer from: custom_tokenizer
[2023-10-20 17:31:26,293] [INFO] Copying vocab from: custom_tokenizer

You could also test it by training on mini dataset with some telling examples such as:

{"text": "I'm in class 3A"}
{"text": "An ex.ample with a dot"}

and test that the resultant pipeline tokenizes as expected:

import spacy

nlp = spacy.load('model-best')
doc = nlp("I'm in class 3A")
tokens = [token for token in doc]
[I'm, in, class, 3, A]

Hope that can get you started, let me know how it goes! And again, sorry for the late reply.

1 Like

Hi @magdaaniol Thank you so much I tried the setup and it works perfectly, after setting up the training I see that loss of tok2vec is 0.00 throughout the epochs. What I understand from this is that the tok2vec layer is not being trained unlike when a blank model is used, is it because I'm using 'en_core_web_sm'? I'm having a hard time understanding the architecture. Can you please shed some light on it?

I will be using transformer models to replace the tok2vec layer for the next steps and I'm having a hard time to understand what the loss exactly means here.

Thanks again for the detailed documentation to incorporate the custom-tokenizer

Hi @AakankshaP ,

That's right, tok2vec is not being trained because none of the components down the pipeline use its predictions so there's no backpropagation of the loss. If you followed the steps we discussed earlier, the NER component (as specified in the config you have used) has its own , internaltok2vec so it doesn't use the one at the beginning of the pipeline.
In fact, this first tok2vec should be frozen just like the other components of en_core_web_sm. It shouldn't change the performance in your example but that would be a more correct way to do it.
So there's one more modification to the training config that I missed in my original instruction: list the tok2vec under frozen_components

"frozen_components": ["tok2vec"]

Just to explain a bit more:
There are actually two ways to use the tok2vec (embedding) layer: you could make the components share the same tok2vec layer or be completely independent and have their own, internal tok2vec layer (which is the default setup if you generate the base config using spacy-config command - as you could observe in your training).
Each setup has its own advantages and disadvantages and they are very nicely explained in this spaCy doc: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation
You can also find there information on how to set shared and independent embedding layer in the config.

For en_core_web_trf , please follow this script to generate the initial config (which you can then modify with your custom tokenizer):
This comes from an example project that you can reuse, but there's also a more generic documentation of creating config for transformer training here: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation