Unable to train textcat model using en_core_web_md as a base model

Hello -

I'm in the process of migrating some old textcat models to be used with spaCy 3.4.1. For the updated/retrained model, my goal is to maintain approximate performance of the existing model on the same training data.

The old model used word vectors from the spaCy medium model (en_core_web_md). However, when I attempt to train the model via prodigy (v1.11.8), I see the following error:

$ prodigy train --textcat-multilabel $MY_DATASET --base-model en_core_web_md
...
=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.textcat_multilabel.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

I see the same error when I attempt use of en_core_web_lg. Also, FWIW, I've tried using the transformer model en_core_web_trf, and while training seems to proceed, at the moment I do not have access to a GPU for training, nor are we running GPUs in production.

I'm aware of this issue, but I'm not quite sure if it applies here or how to apply the suggested workaround.

Any suggestions or insight? Many thanks in advance!

1 Like

Hi Michael,

I'll try to reproduce the steps here. Here's an examples.jsonl file.

{"text": "this text is very positive"}
{"text": "this text is very negative"}
{"text": "boo and hiss!"}
{"text": "yay and hurrah!"}
{"text": "i am neutral"}

These are annotated via:

python -m prodigy textcat.manual issue-5962 examples.jsonl --label pos,neg,neu

Next, when I run prodigy train on this data:

python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md

Then I see this intermediate output appear:

ℹ Using CPU

========================= Generating Prodigy config =========================
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
ℹ Auto-generating config with spaCy
config={'paths': {'train': None, 'dev': None, 'vectors': None, 'init_tok2vec': None}, 'system': {'gpu_allocator': None, 'seed': 0}, 'nlp': {'lang': 'en', 'pipeline': ['textcat_multilabel'], 'batch_size': 1000, 'disabled': [], 'before_creation': None, 'after_creation': None, 'after_pipeline_creation': None, 'tokenizer': {'@tokenizers': 'spacy.Tokenizer.v1'}}, 'components': {'textcat_multilabel': {'factory': 'textcat_multilabel', 'model': {'@architectures': 'spacy.TextCatBOW.v2', 'exclusive_classes': False, 'ngram_size': 1, 'no_output_layer': False, 'nO': None}, 'scorer': {'@scorers': 'spacy.textcat_multilabel_scorer.v1'}, 'threshold': 0.5}}, 'corpora': {'dev': {'@readers': 'spacy.Corpus.v1', 'path': '${paths.dev}', 'max_length': 0, 'gold_preproc': False, 'limit': 0, 'augmenter': None}, 'train': {'@readers': 'spacy.Corpus.v1', 'path': '${paths.train}', 'max_length': 0, 'gold_preproc': False, 'limit': 0, 'augmenter': None}}, 'training': {'dev_corpus': 'corpora.dev', 'train_corpus': 'corpora.train', 'batcher': {'@batchers': 'spacy.batch_by_words.v1', 'discard_oversize': False, 'tolerance': 0.2, 'size': {'@schedules': 'compounding.v1', 'start': 100, 'stop': 1000, 'compound': 1.001, 't': 0.0}, 'get_length': None}, 'optimizer': {'@optimizers': 'Adam.v1', 'beta1': 0.9, 'beta2': 0.999, 'L2_is_weight_decay': True, 'L2': 0.01, 'grad_clip': 1.0, 'use_averages': False, 'eps': 1e-08, 'learn_rate': 0.001}, 'seed': '${system.seed}', 'gpu_allocator': '${system.gpu_allocator}', 'dropout': 0.1, 'accumulate_gradient': 1, 'patience': 1600, 'max_epochs': 0, 'max_steps': 20000, 'eval_frequency': 200, 'score_weights': {'cats_score': 1.0, 'cats_score_desc': None, 'cats_micro_p': None, 'cats_micro_r': None, 'cats_micro_f': None, 'cats_macro_p': None, 'cats_macro_r': None, 'cats_macro_f': None, 'cats_macro_auc': None, 'cats_f_per_type': None, 'cats_macro_auc_per_type': None}, 'frozen_components': [], 'annotating_components': [], 'before_to_disk': None, 'logger': {'@loggers': 'spacy.ConsoleLogger.v1', 'progress_bar': False}}, 'pretraining': {}, 'initialize': {'vectors': '${paths.vectors}', 'init_tok2vec': '${paths.init_tok2vec}', 'vocab_data': None, 'lookups': None, 'tokenizer': {}, 'components': {}, 'before_init': None, 'after_init': None}}
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
[2022-09-21 14:28:21,704] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (3)
[2022-09-21 14:28:21,717] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'textcat_multilabel']
[2022-09-21 14:28:21,717] [INFO] Resuming training for: ['tok2vec']
[2022-09-21 14:28:21,721] [INFO] Created vocabulary
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
[2022-09-21 14:28:23,042] [INFO] Added vectors: en_core_web_md
[2022-09-21 14:28:23,137] [INFO] Finished initializing nlp object
[2022-09-21 14:28:23,140] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (3)
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner', 'textcat_multilabel']
ℹ Frozen components: ['tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SPEED   SCORE 
---  ------  ------------  -------------  ----------  ------  ------
  0       0          0.00           0.25        0.00  385.30    0.00
1000    1000          0.00          58.68        0.00  545.31    0.00
2000    2000          0.00           8.41        0.00  578.05    0.00
3000    3000          0.00           3.04        0.00  560.96    0.00
4000    4000          0.00           1.43        0.00  571.13    0.00
5000    5000          0.00           0.76        0.00  561.90    0.00

You'll notice that I get some warnings about old md versions of the model, but besides that it seems to train fine. That means that I can't reproduce the error locally, but you'll also notice that it generates and prints a JSON blob with config settings. Could you share yours? I'm guessing something is going wrong there on your end.

Also, could you share your Prodigy and spaCy versions? Preferably via:

python -m pip freeze | grep spacy
python -m pip freeze | grep prodigy

You could also varify if the output remains the same when you run with python -m prodigy instead of prodigy? I'd just want to make sure it's not a virtualenv thing.

1 Like

Hi Vincent -

Thanks for the reply!

So we're on the same page, I annotated the dataset you provided and saved in the same issue-5962 dataset, and see the same error trying your commands above. As for your questions:

  1. I do not see the config JSON blob when trying to train, nor do I see similar warnings as you did. I tried adding the --verbose flag but that had no effect.
$ python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.textcat_multilabel.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'
  1. Prepending python -m yields the same output as the above.
  2. My versions of all the things are as follows:
(prodigy-spacy3) nelson:~ michaelcarlin$ python -m pip freeze | grep spacy
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl
en-core-web-md @ https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.0/en_core_web_md-3.4.0-py3-none-any.whl
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl
spacy==3.4.1
spacy-alignments==0.8.5
spacy-legacy==3.0.10
spacy-loggers==1.0.3
spacy-transformers==1.1.8
(prodigy-spacy3) nelson:~ michaelcarlin$ python -m pip freeze | grep prodigy
prodigy==1.11.8
(prodigy-spacy3) nelson:~ michaelcarlin$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version    3.4.1
Location         /Users/michaelcarlin/.pyenv/versions/prodigy-spacy3/lib/python3.8/site-packages/spacy
Platform         macOS-10.16-x86_64-i386-64bit
Python version   3.8.5
Pipelines        en_core_web_md (3.4.0), en_core_web_sm (3.4.0), en_core_web_trf (3.4.0), en_core_web_lg (3.4.0)

(prodigy-spacy3) nelson:~ michaelcarlin$ python -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.11.8
Location         /Users/michaelcarlin/.pyenv/versions/prodigy-spacy3/lib/python3.8/site-packages/prodigy
Prodigy Home     /Users/michaelcarlin/.prodigy
Platform         macOS-10.16-x86_64-i386-64bit
Python Version   3.8.5
Database Name    PostgreSQL
Database Id      postgresql
...

FWIW, I also tried the above in a brand new virtual environment, installing only prodigy, psycopg2 (so I can interface with the data), and downloading en_core_web_md. Still seeing the same errors.

Let me know if this yields any insight on your end. Thanks!

That's strange. I don't know why my output is different, but we can try another remedy in the meantime. We could manually pass a config.cfg file. I generated a partial config via the spaCy docs, and made it complete by running:

python -m spacy init fill-config ./partial.cfg ./config.cfg 

This gave me this config.cfg file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Could you try to train with this config file? A command like this one should suffice:

python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md --config config.cfg

Hi Vincent -

Thanks! Yeah, also confused as to why we don't have the same output.

The config file works! But it's not obvious to me why. Let me know if you have any further insight.