Training after annotating with custom tokenizer

magdaaniol · November 8, 2023, 9:48am

That's right, tok2vec is not being trained because none of the components down the pipeline use its predictions so there's no backpropagation of the loss. If you followed the steps we discussed earlier, the NER component (as specified in the config you have used) has its own , internaltok2vec so it doesn't use the one at the beginning of the pipeline.
In fact, this first tok2vec should be frozen just like the other components of en_core_web_sm. It shouldn't change the performance in your example but that would be a more correct way to do it.
So there's one more modification to the training config that I missed in my original instruction: list the tok2vec under frozen_components

...
[training]
"frozen_components": ["tok2vec"]
...

Just to explain a bit more:
There are actually two ways to use the tok2vec (embedding) layer: you could make the components share the same tok2vec layer or be completely independent and have their own, internal tok2vec layer (which is the default setup if you generate the base config using spacy-config command - as you could observe in your training).
Each setup has its own advantages and disadvantages and they are very nicely explained in this spaCy doc: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation
You can also find there information on how to set shared and independent embedding layer in the config.

For en_core_web_trf , please follow this script to generate the initial config (which you can then modify with your custom tokenizer): https://github.com/explosion/projects/blob/e24a085669b4db6918ffeb2752846089d8dee57a/pipelines/ner_demo_update/scripts/create_config.py
This comes from an example project that you can reuse, but there's also a more generic documentation of creating config for transformer training here: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation

Topic		Replies	Views
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	348	February 19, 2024
Use custom tokenizer in data-to-spacy done , spacy , nightly , training	9	1218	June 17, 2021
data-to-spacy --base-model usage	6	373	September 13, 2023
Train recipe uses different Tokenizer than in ner.manual ner	1	324	August 8, 2023
config.cfg for bert.ner.manual usage , ner , transformers	5	834	September 30, 2022

Training after annotating with custom tokenizer

Related topics