Training after annotating with custom tokenizer

Hi @AakankshaP,

Apologies for the late reply!

Assuming you have already annotated your dataset with your custom tokenizer, the next step would be to generate a config file so that we can control the tokenizer used for data generation and the training.

Prodigy provides a utility for generating the base config and we'll use it as a starting point.
So, assuming your annotated dataset is named custom_tok_annotations and that your custom pipeline is stored in custom_tokenizer directory, you can obtain the base config by running:

python -m prodigy spacy-config custom_tok.cfg --ner custom_tok_annotations --base-model custom_tokenizer

The sourcing of the custom tokenizer from the base model is not automated, though so, before generating the training data, we'll need to provide the instruction in the config.
If you inspect the generated config (custom_tok.cfg), you'll see that it is organized by sections relating to different parts of the pipeline and spaCy has excellent documentation explaining the details. For our immediate purpose we need to modify the [initialize] section by adding [initialize.before_init] section specifying where spaCy should source the tokenizer from before initializing the pipeline.
In the generated custom_tok.cfg, you would add the following:

# Inside your .cfg file
...
[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "custom_tokenizer"
vocab = "custom_tokenizer"
...

You'll also have to delete before_init = null from the [initialize] section so that it doesn't conflict with your new setting. So the complete, edited [initialize]and [initialize.before_init]sections will look like so:

# Inside your .cfg file
...
[initialize]

vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "custom_tokenizer"
vocab = "custom_tokenizer"
...

Note that for that to work the directory with your custom tokenizer (custom_tokenizer in the example) should be in the pwd or the pipeline should be installed as package with this name.

With the edited custom_tok.cfg, we can now procede to generate the train and dev datasets with data-to-spacy
(assuming the name of the dataset is custom_tok and the task is ner and that the custom pipeline is in the directory custom_tokenizer)

python -m prodigy data-to-spacy custom_tok_output --ner custom_tok --config custom_tok.cfg --base-model custom_tokenizer

This will also generate training config in the same directory as the data so you can proceed with training by running:

 python -m spacy train custom_tok_output/config.cfg --paths.train custom_tok_output/train.spacy --paths.dev custom_tok_output/dev.spacy -o . --verbose

When training, you should see that the custom tokenizer is being copied:

=========================== Initializing pipeline ===========================
[2023-10-20 17:31:26,122] [INFO] Set up nlp object from config
[2023-10-20 17:31:26,132] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2023-10-20 17:31:26,132] [INFO] Resuming training for: ['ner', 'tok2vec']
[2023-10-20 17:31:26,139] [INFO] Copying tokenizer from: custom_tokenizer
[2023-10-20 17:31:26,293] [INFO] Copying vocab from: custom_tokenizer

You could also test it by training on mini dataset with some telling examples such as:

{"text": "I'm in class 3A"}
{"text": "An ex.ample with a dot"}
...

and test that the resultant pipeline tokenizes as expected:

import spacy

nlp = spacy.load('model-best')
doc = nlp("I'm in class 3A")
tokens = [token for token in doc]
print(tokens)
[I'm, in, class, 3, A]

Hope that can get you started, let me know how it goes! And again, sorry for the late reply.

1 Like