textcat affects NER

I used textcat to train a new model based on en_core_web_lg.
Apparently the new model does nice text classification but really bad NER.

Is it possible that the NER model is affected by TextCategorizer?

Hmm, it shouldn’t!

The NER and textcat models shouldn’t share any weights except the pre-trained vectors, but the pre-trained vectors should be static. So updates to the model shouldn’t be changing the NER. My guess is that the NER model is being updated during the textcat updates, even though that shouldn’t happen.

For sanity, could you try copying the textcat model directory within the original directory? Assuming this works, this also gives you a workaround until we figure out what’s wrong.

I just noticed that the ner directory is actually missing in my new model. :man_facepalming:t3:
Sorry for the bad problem explanation...

Did you mean copying the textcat from the created model into the original en_core_web_lg?
I tried it but the model don't recognize it.

Ah, this explains a lot – so it looks like the textcat training process disabled the ner pipeline component. So when you save out the model, it saves without NER weights. Will have a look at this issue!

Edit: Just had a look and the most likely explanation is the following line in the textcat.teach recipe:

if input_model is not None:
    nlp = spacy.load(input_model, disable=['ner'])

I think we originally added this for efficiency, since the recipe always keeps a serialized copy of the best model, which can easily take long for large models. But maybe this is a bad default, since it's pretty unintuitive. You should be able to change this behaviour by removing disable=['ner']. (If you do so, keep us updated on the speed and performance!)

Yes – but since the original model didn't have a text classifier, you'll also have to add "textcat" to the pipeline in the model's meta.json. For example:

{
    "pipeline": ["parser", "tagger", "ner", "textcat"]
}

Perfect, Thanks a lot.

Will do!