Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy

Bhargavi1144 · January 25, 2024, 5:00pm

nlp.tokenizer.rules = rules
infixes = list(nlp.Defaults.infixes)
infixes.extend([r"-",r"kt",r"mt", r"$", r":", r"\€", r"\£", r"\¥", r"+", r"[", r"]"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
print(nlp.pipe_names)
tok2vec = nlp.add_pipe("tok2vec")
nlp.to_disk("assets/base_model_new")

I have added some special cases in tokenizer and generated a base model to annotate 100mt as 100 and mt in prodigy.

2.Used data-to-spacy with the base model I have generated above to get the .spacy files

3.At Last Trained the model using spacy train --code command with same custom tokenizer component but the results seems to be not proper.

Can you help me here that am going in a right direction or not?

magdaaniol · January 26, 2024, 3:40pm

Welcome to the forum @Bhargavi1144

What I think is happening is that spaCy was not explicitly instructed to use the custom tokenizer.
Unfortunately, the sourcing of the tokenizer from the base model is not automated (yet) and we need to instruct spaCy where to source the tokenizer from via the config file.

This thread provides step-by-step instruction on how to properly source a custom tokenizer for training:

Let me know if you need extra support on top of that tutorial

Bhargavi1144 · January 29, 2024, 10:46am

Hello,Thanks for the quick response!

I tried the same steps as mentioned in the above documentation.
I could able to generate the config.cfg and training sets using custom tokenizer.

But when I tried to run the spacy train command ,its giving me the below exception:

Though I don't see any messages regarding copying the custom tokenizer as well under Initializing pipeline section.

Can you help me with this step if I missing anything here?

magdaaniol · January 30, 2024, 10:48am

Hi @Bhargavi1144 ,

The error you are getting is likely related to the component not being correctly initialized.
And, indeed, in the "Initializing pipeline" section of the console log, there should be information on copying the tokenizer from a pipeline.

Could you share the config file that you use for train command?

Bhargavi1144 · January 30, 2024, 12:31pm

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
disabled =
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
architectures = "spacy.Tok2VecListener.v1"
width = 96
upstream = "*"

[components.tok2vec]
source = "assets/base_model_new"

[corpora]
readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0

[corpora.ner]
readers = "prodigy.NERCorpus.v1"
datasets = ["asia_steel_rebar_volume"]
eval_datasets =
default_fill = "outside"
incorrect_key = "incorrect_spans"

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components =
annotating_components =
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.optimizer]
optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.components]

[initialize.before_init]
callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "assets/base_model_new"
vocab = "assets/base_model_new"

[initialize.tokenizer]

+++++++++++++++++++++++++++++++++

The above one is the config data I have used.

assets/base_model_new --> This is the model I have generated using custom tokenization and used the same while generating .spacy files as well.

magdaaniol · February 1, 2024, 5:05pm

Hi @Bhargavi1144,

There seem to be differences in how "spacy.copy_from_base_model.v1" works in spaCy v3.6 and v3.7, let me check with the spaCy team and get back to you.

Looking at the config, you are on spaCy 3.7, right?

magdaaniol · February 5, 2024, 3:39pm

Hi @Bhargavi1144 ,

While we are looking into the copy_from_base_model, perhaps you could try downgrading spaCy to 3.6 as a workaround. The sourcing should work with this version.

magdaaniol · February 14, 2024, 3:18pm

Hello again @Bhargavi1144 ,

I've looked into the issue with the help of the spaCy team and my first intuition was not confirmed. The sourcing works correctly in 3.7.X.
I also tried your config substituting the base model and the dataset and it runs without problems.
The issue is that your tok2vec is not intitialized. Looking back at your original snippet:

tok2vec = nlp.add_pipe("tok2vec")
nlp.to_disk("assets/base_model_new")

you can see that your are actually adding tok2vec and directly save it, without intializing it. And then in the train cfg you source from this model:

[components.tok2vec]
source = "assets/base_model_new"

This errors out with this, admittedly, uninformative error because tok2vec is, indeed, not initialized.
To use the blank tok2vec you should pass a factory instead so:

[components.tok2vec]
factory = "tok2vec"

Could you try training w/ the tok2vec factory instead of sourced tok2vec?
And for the logging message about copying of the tokenizer to appear you should run spacy train with --verbose flag - sorry I missed that in the original post!

Bhargavi1144 · February 19, 2024, 12:38pm

Hey,

Thanks for checking the issue!

Instead of copy_from_base_model approach I did try using below approach.

Mentioned below property in config.cfg:
[initialize.before_init]
@callbacks = "customize_tokenizer"

Train command I have used while training:
python -m spacy train assets/config.cfg --paths.train train.spacy --paths.dev dev.spacy --output model --code assets/modify_tokenizer.py

This resolved the issue and model got generated based on the custom tokenization.

But I will try to check copy_from_base_model approach as well with your suggestion as well.

Thanks alot for checking the issues quickly

Topic		Replies	Views
Training after annotating with custom tokenizer spacy , transformers , training	3	589	November 8, 2023
Prodigy is losing my tokeniser usage , spacy	2	419	February 18, 2022
How to define a custom Tokenizer when using prodigy? usage , spacy , solved	3	428	September 20, 2021
Migration from spaCy 2.3 to 3.x + Annotating data in prodigy usage , spacy	1	459	August 29, 2021
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	307	May 1, 2023

Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy

Related topics