I have annotate three labels with separate dataset to identify the following topics in a text classification model: payment, termination and connection.
I batch train each dataset into separate models to make sure that the classifications works fine and it performs well. Then I merge all datasets into one and train a final model with three labels. Here is where I face the problem.
I try some phrases against the final model and phrases that results in a high score on each separate model now ended up with a low score in the final model.
It performs well on two of the labels but always gets a lower score on the third one. It varies which model that gets the lower score. When I run the same phrase against the separate model that got the lower score it works just fine.
Final model:
>>> doc = nlp("Will I receive any more invoices for my connection now when its terminated?")
{'connection': 0.9990618824958801, 'payment': 0.07060318440198898, 'termination':
0.9835721850395203}
The same against the separate payment-model
>>> doc = nlp("Will I receive any more invoices for my connection now when its terminated?")
{'payment': 0.9986937642097473}
Is there any explanation behind the lower score?
When I run phrases that only match two labels I got good result. It is when I enter a phrase that should match all three labels that I got problems.
Hmm. The model which has all three labels effectively has fewer parameters, because the three labels share weights all the way up to the output layer. So, one explanation could be that you need to use wider layers. Unfortunately it’s a bit hard to customize this in the current implementation.
How many examples are in your dataset? And what sort of accuracies are you seeing, both in the single-label case, and when you train with all three labels?
The following code shows how to customize the neural network model used in the TextCategorizer. It’s a bit simpler than the default model (no attention layer, no ensemble with a bag-of-words model), which might cut down some incidental problems.
from thinc.v2v import Model, Affine
from thinc.api import flatten_add_lengths, chain
from thinc.t2v import Pooling, mean_pool
from spacy._ml import logistic, zero_init, Tok2Vec
from spacy.pipeline import TextCategorizer
from spacy.language import Language
class CustomTextCategorizer(TextCategorizer):
def Model(cls, nr_class, **cfg):
"""
Define a simpler CNN-based text classification model.
Token representations are built based on a 5-word window,
the representations are mean-pooled before a linear output layer.
"""
width = cfg.get('width', 128)
embed_size = cfg.get('embed_size', 10000)
tok2vec = Tok2Vec(width, embed_size)
with Model.define_operators({'>>': chain}):
model = (
tok2vec
>> flatten_add_lengths
>> Pooling(mean_pool)
>> zero_init(Affine(nr_class, width))
>> logistic
)
return model
model.tok2vec = tok2vec
return model
# You'll need to make this assignment before loading the model as well.
# Otherwise, when you try to load the pipeline, spaCy will initialize the default
# TextCategorizer class, and the model won't load.
Language.factories['textcat'] = CustomTextCategorizer