Training with base model en_core_web_trf throws error

Training an NER model using the transformer model as base model throws error (see below):

python -m prodigy train ./output_dir --ner ner_ticker --base-model en_core_web_trf

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Program Files\Python\Lib\site-packages\prodigy_main
.py", line 50, in
main()
File "C:\Program Files\Python\Lib\site-packages\prodigy_main
.py", line 44, in main
controller = run_recipe(run_args)
^^^^^^^^^^^^^^^^^^^^
File "cython_src\prodigy\cli.pyx", line 117, in prodigy.cli.run_recipe
File "cython_src\prodigy\cli.pyx", line 118, in prodigy.cli.run_recipe
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 291, in train
train_config = prodigy_config(
^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 118, in prodigy_config
return _prodigy_config(
^^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 145, in _prodigy_config
config = generate_config(config, base_nlp, base_model, list(pipes), silent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 695, in generate_config
tok2vec = base_nlp.get_pipe("tok2vec")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\spacy\language.py", line 650, in get_pipe
raise KeyError(Errors.E001.format(name=name, opts=self.component_names))
KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"

My spaCy info:

============================== Info about spaCy ==============================

spaCy version 3.7.2
Location C:\Program Files\Python\Lib\site-packages\spacy
Platform Windows-10-10.0.22621-SP0
Python version 3.11.6
Pipelines en_core_web_lg (3.7.0), en_core_web_md (3.7.0), en_core_web_sm (3.7.0), en_core_web_trf (3.7.2)

Thanks for your help.

Also, forgot to print my Prodigy stats:

============================== :sparkles: Prodigy Stats ==============================

Version 1.14.6
Location C:\Program Files\Python\Lib\site-packages\prodigy
Prodigy Home C:\Users\Ronny.prodigy
Platform Windows-10-10.0.22621-SP0
Python Version 3.11.6
Spacy Version 3.7.2
Database Name SQLite
Database Id sqlite
Total Datasets 1
Total Sessions 31

Thanks.

hi @ronnysh!

This seems to be similar to this issue. Can you see if this solution works?

Hi @ryanwesslen ,

As I am using Prodigy for training I don't manually create a config file. Is there a way to bypass the tok2vec component when training with Prodigy using the transformer model?

Thanks.

We're working on a fix but as of now, the only option is the workaround in that post using a config file. I'll post back when we have an update on our progress.

1 Like

Any update on this issue. ??!!
using

Prodigy Ver  1.14.12                       
Python Ver   3.10.12                       
spaCy Ver    3.7.2 

Getting the same error, (KeyError: "[E001] No component 'tok2vec' found in pipeline.......) when trying to train the custom NER lables ( like Passport Num, Phone,etc).python3 -m prodigy train --ner ner_final2 ./ner_final2_600_basemodel --config ./config.cfg --gpu-id 0 --eval 0.1 --label-stats --base-model en_core_web_trf

The minimum change in the Config as shown below

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","english_ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.english_ner]
factory = "ner"

Any thoughts, highly useful. Thanks :grinning:

Cheers!
Chandra

Hi @e101sg,

This runtime KeyError during the training of transformer-based pipelines was fixed in Prodigy 1.15.1
As of this Prodigy version you should be able to specify transformer as the embedding layer in your config. That's what you should use (instead of tok2vec) if your base model is a trf based model.

1 Like

pipeline = ["tok2vec","ner","english_ner"]

hmm.. do you mean it should be like pipeline = [["transformer","ner", "english_ner"] ??

Here the english_ner is my custom ner model. Also is it should be trained on GPU with the command --gpu-id 0 because it based on transformer.
Also wonder, What to change in config.cfg to use with Spacy pipelines like en_core_web_sm .... i.e create a new model = Custom ner model + Spacy 's en_core_web_sm pipeline.

Cheers and thanks!! :slight_smile:
e101sg

Hi @e101sg,
That's right:

pipeline = ["transformer","ner", "english_ner"]

would be one way to define a pipeline with 2 NER components.
I recommend you check out this tutorial on combining pre-trained and custom NER components in different ways and related tradeoffs.
Also, this spaCy Discussions thread is relevant if both your NER components are transformer-based: Combining Pretrained and Trained NER Components, both with Transformers · explosion/spaCy · Discussion #9784 · GitHub

1 Like