Training with base model en_core_web_trf throws error

ronnysh · November 17, 2023, 12:01am

Training an NER model using the transformer model as base model throws error (see below):

python -m prodigy train ./output_dir --ner ner_ticker --base-model en_core_web_trf

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Program Files\Python\Lib\site-packages\prodigy_main.py", line 50, in
main()
File "C:\Program Files\Python\Lib\site-packages\prodigy_main.py", line 44, in main
controller = run_recipe(run_args)
^^^^^^^^^^^^^^^^^^^^
File "cython_src\prodigy\cli.pyx", line 117, in prodigy.cli.run_recipe
File "cython_src\prodigy\cli.pyx", line 118, in prodigy.cli.run_recipe
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 291, in train
train_config = prodigy_config(
^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 118, in prodigy_config
return _prodigy_config(
^^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 145, in _prodigy_config
config = generate_config(config, base_nlp, base_model, list(pipes), silent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\prodigy\recipes\train.py", line 695, in generate_config
tok2vec = base_nlp.get_pipe("tok2vec")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python\Lib\site-packages\spacy\language.py", line 650, in get_pipe
raise KeyError(Errors.E001.format(name=name, opts=self.component_names))
KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"

My spaCy info:

============================== Info about spaCy ==============================

spaCy version 3.7.2
Location C:\Program Files\Python\Lib\site-packages\spacy
Platform Windows-10-10.0.22621-SP0
Python version 3.11.6
Pipelines en_core_web_lg (3.7.0), en_core_web_md (3.7.0), en_core_web_sm (3.7.0), en_core_web_trf (3.7.2)

Thanks for your help.

ronnysh · November 17, 2023, 12:03am

Also, forgot to print my Prodigy stats:

============================== Prodigy Stats ==============================

Version 1.14.6
Location C:\Program Files\Python\Lib\site-packages\prodigy
Prodigy Home C:\Users\Ronny.prodigy
Platform Windows-10-10.0.22621-SP0
Python Version 3.11.6
Spacy Version 3.7.2
Database Name SQLite
Database Id sqlite
Total Datasets 1
Total Sessions 31

Thanks.

ryanwesslen · November 17, 2023, 1:28pm

hi @ronnysh!

This seems to be similar to this issue. Can you see if this solution works?

ronnysh · November 17, 2023, 5:41pm

Hi @ryanwesslen ,

As I am using Prodigy for training I don't manually create a config file. Is there a way to bypass the tok2vec component when training with Prodigy using the transformer model?

Thanks.

ryanwesslen · November 22, 2023, 7:30pm

We're working on a fix but as of now, the only option is the workaround in that post using a config file. I'll post back when we have an update on our progress.

e101sg · March 29, 2024, 12:26am

Any update on this issue. ??!!
using

Prodigy Ver  1.14.12                       
Python Ver   3.10.12                       
spaCy Ver    3.7.2

Getting the same error, (KeyError: "[E001] No component 'tok2vec' found in pipeline.......) when trying to train the custom NER lables ( like Passport Num, Phone,etc).python3 -m prodigy train --ner ner_final2 ./ner_final2_600_basemodel --config ./config.cfg --gpu-id 0 --eval 0.1 --label-stats --base-model en_core_web_trf

The minimum change in the Config as shown below

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","english_ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.english_ner]
factory = "ner"

Any thoughts, highly useful. Thanks

Cheers!
Chandra

magdaaniol · March 29, 2024, 12:52pm

Hi @e101sg,

This runtime KeyError during the training of transformer-based pipelines was fixed in Prodigy 1.15.1
As of this Prodigy version you should be able to specify transformer as the embedding layer in your config. That's what you should use (instead of tok2vec) if your base model is a trf based model.

e101sg · March 31, 2024, 10:14am

pipeline = ["tok2vec","ner","english_ner"]

hmm.. do you mean it should be like pipeline = [["transformer","ner", "english_ner"] ??

Here the english_ner is my custom ner model. Also is it should be trained on GPU with the command --gpu-id 0 because it based on transformer.
Also wonder, What to change in config.cfg to use with Spacy pipelines like en_core_web_sm .... i.e create a new model = Custom ner model + Spacy 's en_core_web_sm pipeline.

Cheers and thanks!!
e101sg

magdaaniol · April 3, 2024, 10:50am

Hi @e101sg,
That's right:

pipeline = ["transformer","ner", "english_ner"]

would be one way to define a pipeline with 2 NER components.
I recommend you check out this tutorial on combining pre-trained and custom NER components in different ways and related tradeoffs.
Also, this spaCy Discussions thread is relevant if both your NER components are transformer-based: Combining Pretrained and Trained NER Components, both with Transformers · explosion/spaCy · Discussion #9784 · GitHub

Topic		Replies	Views
Base model without tok2vec throws error spacy	11	1089	February 23, 2024
Issues with ner.batch-train with en_trf_bertbaseuncased_lg after creating a custom set of labels enhancement , usage , ner , solved , transformers	1	1170	October 14, 2019
Training a NER model with en_trf_robertabase_lg usage , spacy , solved , transformers	3	1782	January 19, 2020
Question about configuration file when use en_core_scibert model for ner ner	5	19	July 2, 2025
data-to-spacy --base-model usage	6	372	September 13, 2023

Training with base model en_core_web_trf throws error

Related topics