Text Classifier model architecture

Is there a place in the docs where the architecture of the text classifier is laid out ? i looked at the spacy/_ml.py implementation, but it seems i first have to learn to read thinc code before i can understand what’s going on…
thanks!

The exact details of the architecture are subject to change, so it’s currently not fully documented anywhere. The overall architecture is similar to the Yang et al (2016) hierarchical model described here: https://explosion.ai/blog/deep-learning-formula-nlp#entailment

However, there are some differences. I use a different embedding strategy, and I use CNNs instead of BiLSTMs. The embedding and CNN layers are the same as the NER and parsing models. The best description of this is currently my NER video: https://www.youtube.com/watch?v=sqDHBH9IjRU

Thinc code does look a little different from other neural network libraries. The basic premise is that all the layers and operations in a neural network are just functions, so we can build complicated networks out of function composition, just by defining little combinator functions that connect two layers together. Thinc then lets you temporarily bind these combinators to operators, to make model definition concise.

This line:

with Model.define_operators({'>>': chain, '+': add, '|': concatenate}):

Binds the chain combinator to the >> operator. chain is ust a function that takes two layers, and pipes them together (handling the gradients appropriately. So it’s just a feed-forward relationship. | is bound to the concatenate function, which takes two layers, and makes a layer whose output is the concatenation of the two input layers.

This approach of just “scripting” the network with these higher-order functions means quite complicated architectures are pretty effortless to define. For instance, the embedding layer:


        lower = HashEmbed(width, nr_vector, column=1)
        prefix = HashEmbed(width//2, nr_vector, column=2)
        suffix = HashEmbed(width//2, nr_vector, column=3)
        shape = HashEmbed(width//2, nr_vector, column=4)

        vectors = (
            FeatureExtracter([ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID])
            >> with_flatten(
                uniqued(
                    (lower | prefix | suffix | shape)
                    >> LN(Maxout(width, width+(width//2)*3)),
                    column=0
                )
            )
        )

This creates a new model vectors, which starts out by extracting features from the spaCy doc objects, and then piping the arrays forward into the the actual embedding step. The embedding step is wrapped by two transforms: uniqued performs caching, so for a batch of words, we only have to compute the embedding once per word type. with_flatten concatenates list input into a single array, so that the layer inside can deal with contiguous input. The transformation is undone after the child layer is called, and the gradient is transformed appropriately as well.

The actual embedding step happens in the lines:

 (lower | prefix | suffix | shape)
 >> LN(Maxout(width, width+(width//2)*3))

This part uses that concatenate combinator, to make a vector that’s the concatenation of the embeddings for the lower-case ID, the prefix ID, the suffix ID, and the word shape ID. This long vector is then fed forward into a Maxout layer, with layer normalization.

After the vectors model has been constructed, it’s plugged into the rest of the network. We first apply two convolutional layers to make the vectors position-sensitive, and then we use Yang et al’s parametric attention layer to compute a weighted summary vector, which is fed forward into a multi-layer perceptron. Maxout units with layer-normalization and residual connections are used:

        cnn_model = (
            vectors
            >> with_flatten(
                LN(Maxout(width, vectors_width))
                >> Residual(
                    (ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
                ) ** 2, pad=2
            )
            >> flatten_add_lengths
            >> ParametricAttention(width)
            >> Pooling(sum_pool)
            >> Residual(zero_init(Maxout(width, width)))
            >> zero_init(Affine(nr_class, width, drop_factor=0.0))
        )

The output layer has one neuron per class label. No softmax is applied, because we’re not quite done yet. The CNN model is then stacked with a unigram bag of words:


linear_model = (
    _preprocess_doc
    >> LinearModel(nr_class)
)
        
model = (
    (linear_model | cnn_model)
    >> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
    >> logistic
)

The unigram bag-of-words is somewhat helpful for Prodigy, especially if no pre-trained vectors are available. Without pre-trained vectors, the CNN model can take some time to get started, while the unigram bag-of-words starts learning much faster. The bag-of-words model is very cheap to compute, so even in cases where it doesn’t help, it also doesn’t hurt.

A similar but slightly different architecture can be found here: https://github.com/explosion/thinc/blob/master/examples/imdb_cnn.py . This one’s a bit more like Yang et al’s model, in that it has a per-sentence step, so it’s more hierarchical.

Finally, one unfortunate detail…The Affine layers in this code are misnamed :(. They do indeed use a bias. I got my terminology wrong when I was naming the class…

1 Like

Thank you, this is great. I appreciate the detailed walk-through. Makes me want to look deeper into Thinc.

Hello Guys,

I’ve been using spacy for a cople of months and I am trying to dive a little deeper in spaCy v2 and Thinc. I am currently doing document classification with the Texclassifier provided by spaCy and I am trying to understand the model behind the classifier. I was wondering if someone could walk me thought the steps that build the word embeddings.

I get from the post, other ones and the code that I have been reading that the model vectors gathers the features of each doc using doc.to_array in the FeatureExtracter and that all gets converted to a single vector. However, this is just the embedding step. Can some give me more light on what is going on in the cnn_model .

I am also trying to mapp the names here and in the code to the concepts explained in the post Embed, encode, attend, predict. As far as I can understand, in this case the ID which goes in to the embedd step is represented by ‘doc.to_array’. I am not beeing able to recognize the encoding step, where the sentece matrix is build. I see the layer parametricAttention but the attend step is also unclear for me.

Thank in advance!

Hi @agh92,

If you haven’t seen it yet, this video about the named entity recognition model might be useful to you: https://www.youtube.com/watch?v=sqDHBH9IjRU

The text classification model in spaCy is really designed around Prodigy’s requirements, so it might not be the best for all situations. In particular, on some problems bag-of-words models with linear classification models perform really well. If this type of model works well on your data, it’s usually best to use that, instead of using a more complicated neural network approach. The implementations in Scikit-Learn and Vowpal Wabbit are both very good.

Anyway. To answer your question: the CNN model is the “encode” step here, because it updates each vector with information from the surrounding context. The NER video explains this a bit, but you should also be able to find plenty of other explanations around the web.

Thanks a lot ! I watch the video and it helped.

Thanks for this nice explanation. When you mention Maxout layer, is it related to the “Maxout Network” paper by Goodfellow ?

@d5555 Yeah, that’s where the idea was introduced. I think you can also see it as just maxpooling with 3 “filters”, but to be honest the computer vision presentation of CNNs has always confused me a bit, so I’m not 100% sure it’s the same. Anyway it’s just like:

def maxout(weights, bias, inputs):
    # Assume weights is shaped like (nr_in, nr_out, nr_piece)
    # Inputs will be shaped like (batch_size, nr_in)
    # Outputs will be shaped like (batch_size, nr_out)
    # e.g. if batch size 32, nr_in is 300, nr_out is 128, nr_piece is 3
    # Weights will be shaped (300, 128, 3)
    # Bias will be shaped (128, 3)
    # Inputs will be shaped (32, 300)
    # Outputs will be shaped (32, 128)
    weights = weights.reshape((weights.shape[0], weights.shape[1] * weights.shape[2])
    unpooled = (inputs @ weights).reshape((inputs.shape[0], weights.shape[1], weights.shape[2]))
    unpooled += bias
    outputs = unpooled.max(axis=-1)
    return outputs