No component 'tok2vec' error when trying to improve a textcat multilabel model

toadle · July 23, 2024, 3:22pm

Hey everybody,

I'm training a model to categorize bank-turnovers with multiple labels.
I have a dataset already imported into prodigy and can train the model successfully like this:

prodigy train models --textcat-multilabel bank_turnovers --lang "de"

The model works and I can see results when loading it an making predictions.

Now I tried to finetune it using textcat.teach like this:

prodigy textcat.teach bank_turnovers_teach models/model-best ./assets/bank_turnovers_annotated.jsonl --label <all my labels>

This also worked and I went through roughly 400 examples.
Now I'm trying to integrate those examples into my existing model like this:

prodigy train --base-model models/model-best --textcat-multilabel bank_turnovers_teach

But this yields:

ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 291, in train
    train_config = prodigy_config(
                   ^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 118, in prodigy_config
    return _prodigy_config(
           ^^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 145, in _prodigy_config
    config = generate_config(config, base_nlp, base_model, list(pipes), silent)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 695, in generate_config
    tok2vec = base_nlp.get_pipe("tok2vec")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/spacy/language.py", line 650, in get_pipe
    raise KeyError(Errors.E001.format(name=name, opts=self.component_names))
KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['textcat_multilabel']"

I've looked at the config of the model-best: There is truly no tok2vec in there, but I don't know why it even tries to load it or what I am doing wrong.

Can somebody help?

magdaaniol · July 23, 2024, 7:52pm

Welcome to the forum @toadle!

You've done everything right - unfortunately, there's a small bug in how Prodigy generates the training config for spaCy default textcat component.

The default spaCy textcat uses only bag-of-words and not a full-fledged embedder (no tok2veccomponent is required) which makes it very efficient at the cost of some performance. (For that reason, when improving on your baseline you might want to experiment with different architectures as well e.g. spacy.TextCatEnsemble.v2: Model Architectures · spaCy API Documentation)

Prodigy is putting tok2vec on the pipeline list while it's not required and we'll fix it asap.
In the meantime, you could split your finetuning workflow in 3 steps:

Generate the training config:

python -m prodigy spacy-config train_v2.cfg --textcat-multilabel bank_turnovers_teach --base-model models/model-best

This will generate the train_v2.cfg file in your pwd which is the new training config file.

Fix the config manually be removing tok2vec form [nlp][pipeline] section:
On line 13 it will say: pipeline = ["tok2vec","textcat_multilabel"]
while it should be: pipeline = ["textcat_multilabel"]
train with --config set to the manually edited config:

python -m prodigy train ./modelsv2 --textcat-multilabel bank_turnovers_teach --config train_v2_config.cfg

Sorry about this inconvenience - we should be able to ship a patch soon!

magdaaniol · July 30, 2024, 10:54am

Hi, just wanted to update that we have now released Prodigy 1.1.5.7 that fixes the issue: Changelog · Prodigy · An annotation tool for AI, Machine Learning & NLP

toadle · July 30, 2024, 12:22pm

Thanks @magdaaniol for getting this done so fast.
I'll hopefully get around to testing this out soon.

Topic		Replies	Views
Error using XLNet for text classification: No component 'trf_tok2vec' found in pipeline textcat , transformers	1	962	January 12, 2020
Unable to train textcat model using en_core_web_md as a base model textcat	11	1677	May 2, 2023
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
Train binary textcat in Prodigy Nightly textcat , done , nightly	3	773	July 19, 2021
train-curve command stuck for multilabel textcat model usage , training	2	454	September 6, 2021

No component 'tok2vec' error when trying to improve a textcat multilabel model

Related topics