Text classification, model "forgets" about trained named entities after textcat.batch-train

Hi,

I used the following command to batch train a new text category:

prodigy textcat.batch-train class nl_tech --output nl_tech --n-iter 10 --eval-split 0.5 --dropout 0.4

The command adds the “textcat” folder to the model. But the “ner” folder is removed (or not copied) from the original model.

Please advice,

Bart Boerman

Hi! I think the reason for this is that the training recipe disables the entity recognizer. Because textcat.batch-train (and the other training commands) serialize out the best model on each epoch, all other components are disabled to make it run faster and consume less memory. You can change this by editing the following line in the recipe:

nlp = spacy.load(input_model, disable=['ner'])

Now that spaCy supports a stable mechanism for temporarily disabling pipes (nlp.disable_pipes), we should be able to rewrite the recipe so that all other pipes are removed, not serialized during training and only restored at the very last step.

In spaCy v2.0, you can train all components separately (and mix and match models), so you can also just drop the textcat weights into a model and add "textcat" to the pipeline in the meta JSON.

Wow, thanks for the fast response -)! I copied “ner” from the old model to the new model and made the following change to the meta.json:

“pipeline”:[
“sbd”,
“tagger”,
“parser”,
“ner”,
“textcat”
],

… bingo … The model works for ner and classification. Thanks!!!

I have an off topic question: Would training word vectors (Word2vec ) benefit both ner and classification?

Glad to hear it worked!

If you retrain the NER and textcat, you might be able to benefit from custom vectors. But if you don’t retrain the NER, you won’t be able to make use of them.

Thanks for your reply. Iám planning on retraining :grinning:

Just released v1.5.0, which should now temporarily disable all other pipes during training and restore them right before you save out the best model :tada: This means no components get lost and Prodigy also won’t have to serialize all other components on every epoch.

Cheers :boom:. I just installed the new version and will give it a spin.