Unable to train textcat model using en_core_web_md as a base model

Hello -

I'm in the process of migrating some old textcat models to be used with spaCy 3.4.1. For the updated/retrained model, my goal is to maintain approximate performance of the existing model on the same training data.

The old model used word vectors from the spaCy medium model (en_core_web_md). However, when I attempt to train the model via prodigy (v1.11.8), I see the following error:

$ prodigy train --textcat-multilabel $MY_DATASET --base-model en_core_web_md
...
=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.textcat_multilabel.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

I see the same error when I attempt use of en_core_web_lg. Also, FWIW, I've tried using the transformer model en_core_web_trf, and while training seems to proceed, at the moment I do not have access to a GPU for training, nor are we running GPUs in production.

I'm aware of this issue, but I'm not quite sure if it applies here or how to apply the suggested workaround.

Any suggestions or insight? Many thanks in advance!

1 Like