Custom TextClassifier model for sequences

The built-in TextClassifier has worked well for building a prototype that predicts a harassment label. My dataset is nearing 5000 items, my model accuracy against eval set is between ~75% and ~79%, and my train-curve outputs remain positive.

While my dataset is still small, I would like to experiment with other types of models to try and better predict the right labels for conversational exchanges.

In particular, I would like to experiment with an LSTM setup to process threaded conversations as sequences of messages. The idea is that the context is essential when classifying harassment:

jaaames: @beardsly OMG, You bitch!
beardsly: back off :fu:

vs.

beardsly: Guess who's getting married? Me!
jaaames: @beardsly OMG, You bitch!"
beardsly: I know, right!?

Does an LSTM seem like a reasonable setup for this type of input? Are there any examples of this kind of model that works with prodigy's active learning?

Thanks in advance for your thoughts.

An LSTM might work there. You might find the model defined here interesting: https://github.com/explosion/thinc/blob/master/examples/imdb_cnn.py

This creates a vector for each sentence using the CNN, parametric attention, and sum pooling. The sentence vectors are then combined into a document vector, using parametric attention and sum pooling. You could insert a BiLSTM layer (or another CNN block) after the foreach(sent2vec) layer. This is basically the Hierarchical Attention Networks model of Liang et al (2016), summarised here: https://explosion.ai/blog/deep-learning-formula-nlp

I added a comment showing where to insert the CNN block. You can import the BiLSTM with from thinc.t2t import BiLSTM. You should use the BiLSTM implementation in thinc with care though — it’s not very well tested, so I’m not sure it’s fully correct. It’s also probably fairly inefficient compared to other libraries, although I’m not sure.

Awesome, I'll poke at it for a while and then try to apply it to my existing dataset.

I appreciate the pointers, thanks!

So I’ve managed to get the cnn.py file to train on my existing dataset, and I have two questions:

  1. do you have any recommendations on how to debug a thinc model? I’m getting shape errors when trying to insert the BiLSTM, and I don’t have an intuition for what the data/shape is at various steps in the model. Maybe a clever debug layer you can insert in the model at various places or some such?

  2. Is there supposed to be an easy way to extend the prodigy TextClassifier with a custom thinc.neural.Model? I’ve mostly reproduced what I guess is the implementation of the prodigy TextClassifier class, with the model from cnn.py mixed in, but it feels like I’m producing a lot of unnecessary code (e.g. to print the training stats, drive the training loop, etc.)

do you have any recommendations on how to debug a thinc model?

This is the fault of the BiLSTM layer, which as I said is a bit unloved -- most of the others have nice shape checks :(. The BiLSTM outputs at 2*width, and the residual layers require input width == output width. The simplest solution will be to use 2*width in the subsequent layers.

As a general tip for debugging: you can always wrap any function in thinc.api.layerize, like this:

@layerize
def printer(inputs, drop=0.):
    print(inputs)
    def print_gradient(d_inputs, sgd=None):
        print(d_inputs)
    return inputs, d_inputs

This will give you a Thinc model you can insert anywhere, to spy on what's going on. I often insert these to monitor the mean and variance of the activations and gradients. Doing this on every 100 updates is good. You want to look for neurons that have 0 variance (i.e. they're always the same value). You also want to check whether the means are increasing, or remaining stable.

Is there supposed to be an easy way to extend the prodigy TextClassifier with a custom thinc.neural.Model?

Either create a subclass that overwrites the .Model() method, which creates the model, or pass in the model when creating the class.

You can also assign to textcat.model if that's easier. After you create the textcat object, textcat.model should have the value True (that's the deault in the __init__(). It then calls .Model() during .begin_training(), .from_bytes() or .from_disk(), but only if textcat.model is set to True.

Because the model is created late, rather than during __init__(), it's easy to assign a different model instead.

A more general comment:

One of our next priorities is to finish writing wrappers for other libraries, so that you can use a PyTorch or Tensorflow model within spaCy and Prodigy. You'll also be able to plug a model for one of these libraries into a Thinc model, or even wire together networks from two libraries. Personally I'm very excited to try using XGBoost in some of my models. I think the academic community has been biased against it, especially in NLP.

We're very anxious to avoid a lock in effect, where people would rather use a different machine learning library but feel they're stuck with ours. The last thing we want to do is spend our time replacing things people are already happy with!

3 Likes