Nested heirachy for textcat

I am trying to train a textcategorizer with a nested hierachy, e.g. building > house > semi-detached house. Should I use 3 different models for each level of hierachy, or is it possible to combine the 3 levels? Also, how should I go about feeding the results from 1 level to the next? e.g. if in level 1 it’s predicted category is building (rather than, say, ship) then in the next level, the buildling category should have a higher weight on what the choice of category is?

If it’s 3 different models, should I add them all to the same Spacy pipeline or will they cause clashes as they are all under the cats namespace?

Also, if I want to pass the lemmatised version of the text to the classifier are there examples of how to do this?

How many categories in the total hierarchy?

If your category scheme has only has a few dozen categories in total, probably the approach that will take the least coding is to keep the classification problem flat, and then use your hierarchy at runtime to find the best scoring leaf class.

Let’s say you have a hierarchy of 3 levels, each with three choices within them. So there’s 9 leaf labels: 0.0.0…2.2.2. There are also 12 (?) labels for the non-leaf categories: 0.., 1.., 2.., 0.0., 0.1., 0.2.*, etc. We would then define the probability of assigning a category as the product of the probabilities along its path. So the category 0.0.2 would be P(0.0.2)P(0.0.)P(0.). We would compute these path probabilities for each leaf to find the best-scoring leaf category.

This is the low-effort approach because spaCy’s text classifier doesn’t assume the classes are mutually exclusive. So, you don’t really need to do anything on the Prodigy side to take this approach. When you go to use the model, all you have to do is add a spaCy pipeline component that adjusts the doc.cats scores:

def adjust_textcat_path_scores(doc):
    # Logic here

nlp.add_pipe(adjust_textcat_path_scores, after='textcat')

I think there are lots of more satisfying solutions, but I’m not sure which single approach to recommend. I suspect if you asked three researchers you might get three different answers.

The disadvantage of defining entirely different models is the models won’t get to share any information. This seems inefficient. It’s probably better if the same word embeddings, CNN etc can be used for all of the models. You could have different output layers for the different levels, and share the lower layers? This might be a bit fiddly to implement. Unfortunately Thinc doesn’t currently have a hierarchical softmax function, or I would suggest that as another relatively simple alternative.

hmm, I have around 100 possible categories across 3 levels in my list (although some are probably a lot more common then others) – would that be too much?

So if I try to use different output layers for the 3 categories, would I be adding extra outputs to the default thinc classification model?

In any case I’ll give the flat classifier a try first to see if it got enough accuracy for my use case.

Thanks for the pointer :nerd_face::

A quick note that I should’ve mentioned in my last reply: of course it makes sense to exploit the category hierarchy as much as possible when annotating — e.g. annotate for the top of your hierarchy first, and then annotate within a node of the hierarchy once you have that label. But the way you annotate doesn’t have to match the way you run your classifier once you have a batch of annotations.

See how you go – possibly it’s a bit slow. I would always recommend getting your stuff wired up end-to-end before fiddling with things like the label hierarchy — you’ll at least get predictions if you flatten it out, and you can work on improving the accuracy once you have everything connected.

Once you’re tuning, you’ll probably want to export data from Prodigy and train other text classifiers e.g. from Scikit-Learn. I think Vowpal Wabbit has support for hierarchical classification schemes, and it’s super fast. You can export the annotations with prodigy db-out <dataset name>. This will give you the data in a jsonl format. Scikit-Learn in particular is really great for sanity-checking. You can train some simple bag-of-words models to get a baseline, and figure out whether something’s not right.

Yes. The good news is Thinc is designed to make this sort of thing pretty easily…The bad news is there’s no real documentation, and the API is unstable. You could also just use Tensorflow or PyTorch if you wanted to write a different model yourself.

Here’s a quick example of what things would look like in Thinc. The main thing to understand is that Thinc takes a very “functional programming” view of the problem of wiring neural networks together. A model is just a function that can return a callback to do the backward pass. Then we write other functions to compose these models.

Let’s say we want to have some model feed-forward into three separate output layers. The function to compose them would look like this:

# Note: Example code that I have not run.
from thinc.api import wrap

def multiplex(lower_layer, output_layers):
    ''''Connect a lower layer to multiple outputs. The resulting layer outputs a tuple of values, and expects a tuple of gradients.'''
    def multiplex_forward(inputs, drop=0.):
        '''Perform the forward pass, and return a callback to complete the backward pass.'''
        hidden, get_d_inputs = lower_layer.begin_update(inputs, drop=drop)
        outputs = []
        get_d_hiddens = []
        for output_layer in output_layers:
            output, get_d_hidden = output_layer.begin_update(hidden, drop=drop)
            outputs.append(output)
            get_d_hiddens.append(get_d_hidden)
        def multiplex_backward(d_outputs, sgd=None):
            '''Callback to complete the backward pass. Expects the gradient w.r.t. the outputs,
            and a callable, 'sgd', which is the optimizer.'''
            d_hidden = get_d_hiddens[0](d_outputs[0], sgd=sgd)
            for d_output, get_d_hidden in zip(d_outputs[1:], get_d_hiddens[1:]):
                d_hidden += get_d_hidden(d_output, sgd=sgd)
            d_inputs = get_d_inputs(d_hidden)
            return d_inputs
        return outputs, multiplex_backward
    # Turns our function into a thinc.model.Model instance, and remembers its sublayers (for serialization etc)
    model = wrap(multiplex_forward, [lower_layer] + outputs)
    return model

I haven’t run that, so it’s probably full of bugs — but it should be roughly what you would need to do. Logically, if you connect 3 output layers to some input layer, the gradients from those output layers get summed to compute the gradient to feed back to the input. (It might be tempted to weight that sum, if some output is less important than another. This can work, but equivalently you can just weight the loss function producing the gradients that are flowing down. This should give you the same thing, while being a bit cleaner and easier to describe.)