Text classification, model "forgets" about trained named entities after textcat.batch-train

bboerman · May 31, 2018, 11:41am

Hi,

I used the following command to batch train a new text category:

prodigy textcat.batch-train class nl_tech --output nl_tech --n-iter 10 --eval-split 0.5 --dropout 0.4

The command adds the “textcat” folder to the model. But the “ner” folder is removed (or not copied) from the original model.

Please advice,

Bart Boerman

ines · May 31, 2018, 11:47am

Hi! I think the reason for this is that the training recipe disables the entity recognizer. Because textcat.batch-train (and the other training commands) serialize out the best model on each epoch, all other components are disabled to make it run faster and consume less memory. You can change this by editing the following line in the recipe:

nlp = spacy.load(input_model, disable=['ner'])

Now that spaCy supports a stable mechanism for temporarily disabling pipes (nlp.disable_pipes), we should be able to rewrite the recipe so that all other pipes are removed, not serialized during training and only restored at the very last step.

In spaCy v2.0, you can train all components separately (and mix and match models), so you can also just drop the textcat weights into a model and add "textcat" to the pipeline in the meta JSON.

bboerman · May 31, 2018, 12:03pm

Wow, thanks for the fast response -)! I copied “ner” from the old model to the new model and made the following change to the meta.json:

“pipeline”:[
“sbd”,
“tagger”,
“parser”,
“ner”,
“textcat”
],

… bingo … The model works for ner and classification. Thanks!!!

I have an off topic question: Would training word vectors (Word2vec ) benefit both ner and classification?

honnibal · June 1, 2018, 1:58pm

Glad to hear it worked!

If you retrain the NER and textcat, you might be able to benefit from custom vectors. But if you don’t retrain the NER, you won’t be able to make use of them.

bboerman · June 1, 2018, 2:05pm

Thanks for your reply. Iám planning on retraining

ines · June 7, 2018, 5:46pm

Just released v1.5.0, which should now temporarily disable all other pipes during training and restore them right before you save out the best model This means no components get lost and Prodigy also won’t have to serialize all other components on every epoch.

bboerman · June 7, 2018, 9:31pm

Cheers . I just installed the new version and will give it a spin.

Topic		Replies	Views
textcat.manual? usage , ner , textcat , solved	4	1602	March 29, 2019
Named Entities(manual) usage , ner , solved	4	803	May 11, 2018
Does textcat use NER entities as features? ner , textcat , spacy , solved	2	558	April 20, 2021
textcat affects NER textcat , done	4	768	January 29, 2018
Basic question about Prodigy annotations and model training. usage , ner	12	746	January 18, 2019

Text classification, model "forgets" about trained named entities after textcat.batch-train

Related topics