Unable to train textcat model using en_core_web_md as a base model

macarlin · September 20, 2022, 3:43pm

Hello -

I'm in the process of migrating some old textcat models to be used with spaCy 3.4.1. For the updated/retrained model, my goal is to maintain approximate performance of the existing model on the same training data.

The old model used word vectors from the spaCy medium model (en_core_web_md). However, when I attempt to train the model via prodigy (v1.11.8), I see the following error:

$ prodigy train --textcat-multilabel $MY_DATASET --base-model en_core_web_md
...
=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.textcat_multilabel.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

I see the same error when I attempt use of en_core_web_lg. Also, FWIW, I've tried using the transformer model en_core_web_trf, and while training seems to proceed, at the moment I do not have access to a GPU for training, nor are we running GPUs in production.

I'm aware of this issue, but I'm not quite sure if it applies here or how to apply the suggested workaround.

Any suggestions or insight? Many thanks in advance!

koaning · September 21, 2022, 12:33pm

Hi Michael,

I'll try to reproduce the steps here. Here's an examples.jsonl file.

{"text": "this text is very positive"}
{"text": "this text is very negative"}
{"text": "boo and hiss!"}
{"text": "yay and hurrah!"}
{"text": "i am neutral"}

These are annotated via:

python -m prodigy textcat.manual issue-5962 examples.jsonl --label pos,neg,neu

Next, when I run prodigy train on this data:

python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md

Then I see this intermediate output appear:

ℹ Using CPU

========================= Generating Prodigy config =========================
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
ℹ Auto-generating config with spaCy
config={'paths': {'train': None, 'dev': None, 'vectors': None, 'init_tok2vec': None}, 'system': {'gpu_allocator': None, 'seed': 0}, 'nlp': {'lang': 'en', 'pipeline': ['textcat_multilabel'], 'batch_size': 1000, 'disabled': [], 'before_creation': None, 'after_creation': None, 'after_pipeline_creation': None, 'tokenizer': {'@tokenizers': 'spacy.Tokenizer.v1'}}, 'components': {'textcat_multilabel': {'factory': 'textcat_multilabel', 'model': {'@architectures': 'spacy.TextCatBOW.v2', 'exclusive_classes': False, 'ngram_size': 1, 'no_output_layer': False, 'nO': None}, 'scorer': {'@scorers': 'spacy.textcat_multilabel_scorer.v1'}, 'threshold': 0.5}}, 'corpora': {'dev': {'@readers': 'spacy.Corpus.v1', 'path': '${paths.dev}', 'max_length': 0, 'gold_preproc': False, 'limit': 0, 'augmenter': None}, 'train': {'@readers': 'spacy.Corpus.v1', 'path': '${paths.train}', 'max_length': 0, 'gold_preproc': False, 'limit': 0, 'augmenter': None}}, 'training': {'dev_corpus': 'corpora.dev', 'train_corpus': 'corpora.train', 'batcher': {'@batchers': 'spacy.batch_by_words.v1', 'discard_oversize': False, 'tolerance': 0.2, 'size': {'@schedules': 'compounding.v1', 'start': 100, 'stop': 1000, 'compound': 1.001, 't': 0.0}, 'get_length': None}, 'optimizer': {'@optimizers': 'Adam.v1', 'beta1': 0.9, 'beta2': 0.999, 'L2_is_weight_decay': True, 'L2': 0.01, 'grad_clip': 1.0, 'use_averages': False, 'eps': 1e-08, 'learn_rate': 0.001}, 'seed': '${system.seed}', 'gpu_allocator': '${system.gpu_allocator}', 'dropout': 0.1, 'accumulate_gradient': 1, 'patience': 1600, 'max_epochs': 0, 'max_steps': 20000, 'eval_frequency': 200, 'score_weights': {'cats_score': 1.0, 'cats_score_desc': None, 'cats_micro_p': None, 'cats_micro_r': None, 'cats_micro_f': None, 'cats_macro_p': None, 'cats_macro_r': None, 'cats_macro_f': None, 'cats_macro_auc': None, 'cats_f_per_type': None, 'cats_macro_auc_per_type': None}, 'frozen_components': [], 'annotating_components': [], 'before_to_disk': None, 'logger': {'@loggers': 'spacy.ConsoleLogger.v1', 'progress_bar': False}}, 'pretraining': {}, 'initialize': {'vectors': '${paths.vectors}', 'init_tok2vec': '${paths.init_tok2vec}', 'vocab_data': None, 'lookups': None, 'tokenizer': {}, 'components': {}, 'before_init': None, 'after_init': None}}
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
[2022-09-21 14:28:21,704] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (3)
[2022-09-21 14:28:21,717] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'textcat_multilabel']
[2022-09-21 14:28:21,717] [INFO] Resuming training for: ['tok2vec']
[2022-09-21 14:28:21,721] [INFO] Created vocabulary
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
[2022-09-21 14:28:23,042] [INFO] Added vectors: en_core_web_md
[2022-09-21 14:28:23,137] [INFO] Finished initializing nlp object
[2022-09-21 14:28:23,140] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (3)
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner', 'textcat_multilabel']
ℹ Frozen components: ['tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SPEED   SCORE 
---  ------  ------------  -------------  ----------  ------  ------
  0       0          0.00           0.25        0.00  385.30    0.00
1000    1000          0.00          58.68        0.00  545.31    0.00
2000    2000          0.00           8.41        0.00  578.05    0.00
3000    3000          0.00           3.04        0.00  560.96    0.00
4000    4000          0.00           1.43        0.00  571.13    0.00
5000    5000          0.00           0.76        0.00  561.90    0.00

You'll notice that I get some warnings about old md versions of the model, but besides that it seems to train fine. That means that I can't reproduce the error locally, but you'll also notice that it generates and prints a JSON blob with config settings. Could you share yours? I'm guessing something is going wrong there on your end.

Also, could you share your Prodigy and spaCy versions? Preferably via:

python -m pip freeze | grep spacy
python -m pip freeze | grep prodigy

You could also varify if the output remains the same when you run with python -m prodigy instead of prodigy? I'd just want to make sure it's not a virtualenv thing.

macarlin · September 21, 2022, 2:09pm

Hi Vincent -

Thanks for the reply!

So we're on the same page, I annotated the dataset you provided and saved in the same issue-5962 dataset, and see the same error trying your commands above. As for your questions:

I do not see the config JSON blob when trying to train, nor do I see similar warnings as you did. I tried adding the --verbose flag but that had no effect.

$ python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.textcat_multilabel.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

Prepending python -m yields the same output as the above.
My versions of all the things are as follows:

(prodigy-spacy3) nelson:~ michaelcarlin$ python -m pip freeze | grep spacy
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl
en-core-web-md @ https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.0/en_core_web_md-3.4.0-py3-none-any.whl
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl
spacy==3.4.1
spacy-alignments==0.8.5
spacy-legacy==3.0.10
spacy-loggers==1.0.3
spacy-transformers==1.1.8
(prodigy-spacy3) nelson:~ michaelcarlin$ python -m pip freeze | grep prodigy
prodigy==1.11.8
(prodigy-spacy3) nelson:~ michaelcarlin$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version    3.4.1
Location         /Users/michaelcarlin/.pyenv/versions/prodigy-spacy3/lib/python3.8/site-packages/spacy
Platform         macOS-10.16-x86_64-i386-64bit
Python version   3.8.5
Pipelines        en_core_web_md (3.4.0), en_core_web_sm (3.4.0), en_core_web_trf (3.4.0), en_core_web_lg (3.4.0)

(prodigy-spacy3) nelson:~ michaelcarlin$ python -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.11.8
Location         /Users/michaelcarlin/.pyenv/versions/prodigy-spacy3/lib/python3.8/site-packages/prodigy
Prodigy Home     /Users/michaelcarlin/.prodigy
Platform         macOS-10.16-x86_64-i386-64bit
Python Version   3.8.5
Database Name    PostgreSQL
Database Id      postgresql
...

FWIW, I also tried the above in a brand new virtual environment, installing only prodigy, psycopg2 (so I can interface with the data), and downloading en_core_web_md. Still seeing the same errors.

Let me know if this yields any insight on your end. Thanks!

koaning · September 26, 2022, 8:38am

That's strange. I don't know why my output is different, but we can try another remedy in the meantime. We could manually pass a config.cfg file. I generated a partial config via the spaCy docs, and made it complete by running:

python -m spacy init fill-config ./partial.cfg ./config.cfg

This gave me this config.cfg file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Could you try to train with this config file? A command like this one should suffice:

python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md --config config.cfg

macarlin · September 26, 2022, 3:33pm

Hi Vincent -

Thanks! Yeah, also confused as to why we don't have the same output.

The config file works! But it's not obvious to me why. Let me know if you have any further insight.

claycwardell · December 13, 2022, 5:33am

I am having the same error the Michael Carlin had when trying to pass --base-model as an arg to prodigy train. Recreated @koaning's example and got the same problems Michael did. And this problem was reported again here: Trouble loading `en_core_web_lg` as base model - #2 by Jette16.

Would the dev team please look into what's actually happening instead of relying on this work-around?

ryanwesslen · December 13, 2022, 3:23pm

hi @claycwardell!

I've written an internal ticket and it will be prioritized accordingly. Thanks for your feedback!

macarlin · January 19, 2023, 1:51pm

@ryanwesslen In case it's helpful (and certainly seems related), I just observed a similar error when attempting to train a spancat model. I'm not blocked (starting with a blank:en model seems to work fine for the moment), but figured I'd share with the team:

$ prodigy train --spancat $MY_DATASET --base-model en_core_web_lg
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 5 to 45 (inferred from data)
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.spancat.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

Appreciate the continued support on this issue. Thanks guys!

koaning · February 3, 2023, 2:55pm

Just wanted to give a ping: we're started working on this issue. Will keep you posted!

joebuckle · February 14, 2023, 8:20am

Hello, any updates on this issue?

koaning · February 14, 2023, 9:28am

Definitely still working on it, but this ticket is part of the current sprint . It turns out to be a tricky one, but no worries, once it's fixed we'll let you all know.

koaning · May 2, 2023, 3:34pm

Hi all! Just wanted to report that we just made a patch release with a fix for this one. Let us know if issues persist! Details can be found on our changelog!

Topic		Replies	Views
Unable to use train and run data-to-spacy recipes for spancat on prodigy 1.11.10 solved , spancat	4	877	May 4, 2023
Trouble loading `en_core_web_lg` as base model training	3	941	May 2, 2023
textcat.teach with custom model from spaCy textcat , spacy , solved	3	472	May 21, 2020
mismatched structure when using tranformers model to train textcat (en_core_web_trf) textcat , spacy , transformers	16	1346	March 29, 2023
Spacy pretrain best practices usage , done , spacy	16	5281	March 13, 2020

Unable to train textcat model using en_core_web_md as a base model

Related topics