No component 'tok2vec' error when trying to improve a textcat multilabel model

Hey everybody,

I'm training a model to categorize bank-turnovers with multiple labels.
I have a dataset already imported into prodigy and can train the model successfully like this:

prodigy train models --textcat-multilabel bank_turnovers --lang "de"

The model works and I can see results when loading it an making predictions.

Now I tried to finetune it using textcat.teach like this:

prodigy textcat.teach bank_turnovers_teach models/model-best ./assets/bank_turnovers_annotated.jsonl --label <all my labels>

This also worked and I went through roughly 400 examples.
Now I'm trying to integrate those examples into my existing model like this:

prodigy train --base-model models/model-best --textcat-multilabel bank_turnovers_teach

But this yields:

ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 291, in train
    train_config = prodigy_config(
                   ^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 118, in prodigy_config
    return _prodigy_config(
           ^^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 145, in _prodigy_config
    config = generate_config(config, base_nlp, base_model, list(pipes), silent)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/prodigy/recipes/train.py", line 695, in generate_config
    tok2vec = base_nlp.get_pipe("tok2vec")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tim/.pyenv/versions/3.12.1/lib/python3.12/site-packages/spacy/language.py", line 650, in get_pipe
    raise KeyError(Errors.E001.format(name=name, opts=self.component_names))
KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['textcat_multilabel']"

I've looked at the config of the model-best: There is truly no tok2vec in there, but I don't know why it even tries to load it or what I am doing wrong.

Can somebody help?

Welcome to the forum @toadle!

You've done everything right - unfortunately, there's a small bug in how Prodigy generates the training config for spaCy default textcat component.

The default spaCy textcat uses only bag-of-words and not a full-fledged embedder (no tok2veccomponent is required) which makes it very efficient at the cost of some performance. (For that reason, when improving on your baseline you might want to experiment with different architectures as well e.g. spacy.TextCatEnsemble.v2: Model Architectures · spaCy API Documentation)

Prodigy is putting tok2vec on the pipeline list while it's not required and we'll fix it asap.
In the meantime, you could split your finetuning workflow in 3 steps:

  1. Generate the training config:
python -m prodigy spacy-config train_v2.cfg --textcat-multilabel bank_turnovers_teach --base-model models/model-best

This will generate the train_v2.cfg file in your pwd which is the new training config file.

  1. Fix the config manually be removing tok2vec form [nlp][pipeline] section:
    On line 13 it will say: pipeline = ["tok2vec","textcat_multilabel"]
    while it should be: pipeline = ["textcat_multilabel"]

  2. train with --config set to the manually edited config:

python -m prodigy train ./modelsv2 --textcat-multilabel bank_turnovers_teach --config train_v2_config.cfg

Sorry about this inconvenience - we should be able to ship a patch soon!

Hi, just wanted to update that we have now released Prodigy 1.1.5.7 that fixes the issue: Changelog · Prodigy · An annotation tool for AI, Machine Learning & NLP

Thanks @magdaaniol for getting this done so fast.
I'll hopefully get around to testing this out soon.