Training after annotating with custom tokenizer

Hi @AakankshaP ,

That's right, tok2vec is not being trained because none of the components down the pipeline use its predictions so there's no backpropagation of the loss. If you followed the steps we discussed earlier, the NER component (as specified in the config you have used) has its own , internaltok2vec so it doesn't use the one at the beginning of the pipeline.
In fact, this first tok2vec should be frozen just like the other components of en_core_web_sm. It shouldn't change the performance in your example but that would be a more correct way to do it.
So there's one more modification to the training config that I missed in my original instruction: list the tok2vec under frozen_components

...
[training]
"frozen_components": ["tok2vec"]
...

Just to explain a bit more:
There are actually two ways to use the tok2vec (embedding) layer: you could make the components share the same tok2vec layer or be completely independent and have their own, internal tok2vec layer (which is the default setup if you generate the base config using spacy-config command - as you could observe in your training).
Each setup has its own advantages and disadvantages and they are very nicely explained in this spaCy doc: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation
You can also find there information on how to set shared and independent embedding layer in the config.

For en_core_web_trf , please follow this script to generate the initial config (which you can then modify with your custom tokenizer): https://github.com/explosion/projects/blob/e24a085669b4db6918ffeb2752846089d8dee57a/pipelines/ner_demo_update/scripts/create_config.py
This comes from an example project that you can reuse, but there's also a more generic documentation of creating config for transformer training here: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation