Problem with multi labeled text classifier

I have annotate three labels with separate dataset to identify the following topics in a text classification model: payment, termination and connection.

I batch train each dataset into separate models to make sure that the classifications works fine and it performs well. Then I merge all datasets into one and train a final model with three labels. Here is where I face the problem.

I try some phrases against the final model and phrases that results in a high score on each separate model now ended up with a low score in the final model.

It performs well on two of the labels but always gets a lower score on the third one. It varies which model that gets the lower score. When I run the same phrase against the separate model that got the lower score it works just fine.

 Final model:

 >>> doc = nlp("Will I receive any more invoices for my connection now when its terminated?")
     {'connection': 0.9990618824958801, 'payment': 0.07060318440198898, 'termination': 

The same against the separate payment-model

   >>> doc = nlp("Will I receive any more invoices for my connection now when its terminated?")
   {'payment': 0.9986937642097473}

Is there any explanation behind the lower score?
When I run phrases that only match two labels I got good result. It is when I enter a phrase that should match all three labels that I got problems.

Hmm. The model which has all three labels effectively has fewer parameters, because the three labels share weights all the way up to the output layer. So, one explanation could be that you need to use wider layers. Unfortunately it’s a bit hard to customize this in the current implementation.

How many examples are in your dataset? And what sort of accuracies are you seeing, both in the single-label case, and when you train with all three labels?

The following code shows how to customize the neural network model used in the TextCategorizer. It’s a bit simpler than the default model (no attention layer, no ensemble with a bag-of-words model), which might cut down some incidental problems.

from thinc.v2v import Model, Affine
from thinc.api import flatten_add_lengths, chain
from thinc.t2v import Pooling, mean_pool
from spacy._ml import logistic, zero_init, Tok2Vec
from spacy.pipeline import TextCategorizer
from spacy.language import Language

class CustomTextCategorizer(TextCategorizer):
    def Model(cls, nr_class, **cfg):
        Define a simpler CNN-based text classification model.
        Token representations are built based on a 5-word window,
        the representations are mean-pooled before a linear output layer.
        width = cfg.get('width', 128)
        embed_size = cfg.get('embed_size', 10000)
        tok2vec = Tok2Vec(width, embed_size)
        with Model.define_operators({'>>': chain}):
            model = (
                >> flatten_add_lengths
                >> Pooling(mean_pool)
                >> zero_init(Affine(nr_class, width))
                >> logistic
        return model
    model.tok2vec = tok2vec
    return model

# You'll need to make this assignment before loading the model as well.
# Otherwise, when you try to load the pipeline, spaCy will initialize the default
# TextCategorizer class, and the model won't load.
Language.factories['textcat'] = CustomTextCategorizer

Thank you for you answer!
I have the following numbers of annotations in each dataset, and the total of them put together in the final one.

Do you see anything strange in the numbers of accepts/rejects below?

                        payment     connection       termination
accepts                   405             798              195
reject                     568            751              449
ignores                   1157            992             4068
total                     2128           2546             4710

The accuracy is quite high, each separate has about 94% and the final one has 90%.