Nested hierarchy for textcat

I am trying to train a textcategorizer with a nested hierachy, e.g. building > house > semi-detached house. Should I use 3 different models for each level of hierachy, or is it possible to combine the 3 levels? Also, how should I go about feeding the results from 1 level to the next? e.g. if in level 1 it's predicted category is building (rather than, say, ship) then in the next level, the buildling category should have a higher weight on what the choice of category is?

If it's 3 different models, should I add them all to the same Spacy pipeline or will they cause clashes as they are all under the cats namespace?

Also, if I want to pass the lemmatised version of the text to the classifier are there examples of how to do this?

How many categories in the total hierarchy?

If your category scheme has only has a few dozen categories in total, probably the approach that will take the least coding is to keep the classification problem flat, and then use your hierarchy at runtime to find the best scoring leaf class.

Let’s say you have a hierarchy of 3 levels, each with three choices within them. So there’s 9 leaf labels: 0.0.0…2.2.2. There are also 12 (?) labels for the non-leaf categories: 0.., 1.., 2.., 0.0., 0.1., 0.2.*, etc. We would then define the probability of assigning a category as the product of the probabilities along its path. So the category 0.0.2 would be P(0.0.2)P(0.0.)P(0.). We would compute these path probabilities for each leaf to find the best-scoring leaf category.

This is the low-effort approach because spaCy’s text classifier doesn’t assume the classes are mutually exclusive. So, you don’t really need to do anything on the Prodigy side to take this approach. When you go to use the model, all you have to do is add a spaCy pipeline component that adjusts the doc.cats scores:

def adjust_textcat_path_scores(doc):
    # Logic here

nlp.add_pipe(adjust_textcat_path_scores, after='textcat')

I think there are lots of more satisfying solutions, but I’m not sure which single approach to recommend. I suspect if you asked three researchers you might get three different answers.

The disadvantage of defining entirely different models is the models won’t get to share any information. This seems inefficient. It’s probably better if the same word embeddings, CNN etc can be used for all of the models. You could have different output layers for the different levels, and share the lower layers? This might be a bit fiddly to implement. Unfortunately Thinc doesn’t currently have a hierarchical softmax function, or I would suggest that as another relatively simple alternative.

hmm, I have around 100 possible categories across 3 levels in my list (although some are probably a lot more common then others) – would that be too much?

So if I try to use different output layers for the 3 categories, would I be adding extra outputs to the default thinc classification model?

In any case I’ll give the flat classifier a try first to see if it got enough accuracy for my use case.

Thanks for the pointer :nerd_face::

A quick note that I should’ve mentioned in my last reply: of course it makes sense to exploit the category hierarchy as much as possible when annotating — e.g. annotate for the top of your hierarchy first, and then annotate within a node of the hierarchy once you have that label. But the way you annotate doesn’t have to match the way you run your classifier once you have a batch of annotations.

See how you go – possibly it’s a bit slow. I would always recommend getting your stuff wired up end-to-end before fiddling with things like the label hierarchy — you’ll at least get predictions if you flatten it out, and you can work on improving the accuracy once you have everything connected.

Once you’re tuning, you’ll probably want to export data from Prodigy and train other text classifiers e.g. from Scikit-Learn. I think Vowpal Wabbit has support for hierarchical classification schemes, and it’s super fast. You can export the annotations with prodigy db-out <dataset name>. This will give you the data in a jsonl format. Scikit-Learn in particular is really great for sanity-checking. You can train some simple bag-of-words models to get a baseline, and figure out whether something’s not right.

Yes. The good news is Thinc is designed to make this sort of thing pretty easily…The bad news is there’s no real documentation, and the API is unstable. You could also just use Tensorflow or PyTorch if you wanted to write a different model yourself.

Here’s a quick example of what things would look like in Thinc. The main thing to understand is that Thinc takes a very “functional programming” view of the problem of wiring neural networks together. A model is just a function that can return a callback to do the backward pass. Then we write other functions to compose these models.

Let’s say we want to have some model feed-forward into three separate output layers. The function to compose them would look like this:

# Note: Example code that I have not run.
from thinc.api import wrap

def multiplex(lower_layer, output_layers):
    ''''Connect a lower layer to multiple outputs. The resulting layer outputs a tuple of values, and expects a tuple of gradients.'''
    def multiplex_forward(inputs, drop=0.):
        '''Perform the forward pass, and return a callback to complete the backward pass.'''
        hidden, get_d_inputs = lower_layer.begin_update(inputs, drop=drop)
        outputs = []
        get_d_hiddens = []
        for output_layer in output_layers:
            output, get_d_hidden = output_layer.begin_update(hidden, drop=drop)
            outputs.append(output)
            get_d_hiddens.append(get_d_hidden)
        def multiplex_backward(d_outputs, sgd=None):
            '''Callback to complete the backward pass. Expects the gradient w.r.t. the outputs,
            and a callable, 'sgd', which is the optimizer.'''
            d_hidden = get_d_hiddens[0](d_outputs[0], sgd=sgd)
            for d_output, get_d_hidden in zip(d_outputs[1:], get_d_hiddens[1:]):
                d_hidden += get_d_hidden(d_output, sgd=sgd)
            d_inputs = get_d_inputs(d_hidden)
            return d_inputs
        return outputs, multiplex_backward
    # Turns our function into a thinc.model.Model instance, and remembers its sublayers (for serialization etc)
    model = wrap(multiplex_forward, [lower_layer] + outputs)
    return model

I haven’t run that, so it’s probably full of bugs — but it should be roughly what you would need to do. Logically, if you connect 3 output layers to some input layer, the gradients from those output layers get summed to compute the gradient to feed back to the input. (It might be tempted to weight that sum, if some output is less important than another. This can work, but equivalently you can just weight the loss function producing the gradients that are flowing down. This should give you the same thing, while being a bit cleaner and easier to describe.)

Hey! I'm facing a similar problem but with a larger number of classes ~1000 (possibly too large for the flat approach?). Been reading about different strategies to incorporate information about the hierarchy. Just wanted to check whether the recommendations from this thread still hold or they have maybe changed a bit with newer Spacy versions. Thanks!

My recommendations would depend a lot on the type of hierarchy you have and I would try to exploit the hierarchy as much as possible. Is there anything you can share about the application? Will there be a human-in-the-loop? What are the consequences of getting it wrong?

Thanks for the quick reply! The hierarchy is a tree of academic subjects and the texts to be classified are class materials, lecture notes, readings, etc. The resulting class (or top classes) would be presented as suggestions, to help organize these texts so there would be a human in the loop. A few mistakes are not terrible but too many would make the suggestions useless.

Does it make sense to start with a smaller subset? Part of me is wondering if "does this document represent lecture notes?" is somewhat independent of the academic subject. Or am I wrong?

Oh, no need to guess the document type ("is it a lecture note?") just the academic subject ("is it a note about Science > Biology > Molecular Biology? ")

Part of me is thinking that it makes sense to use first focus on the non-overlapping aspect of some of the labels. Unless there are any articles that are both about Biology and Economics?

I'm also wondering, since this seesm more like an in-depth spaCy question, if it's perhaps best to ask this question on the spaCy discussion board. The spaCy maintainers keep an eye on that repository and may be able to give better advice.

Thanks! will do. While searching for this subject I stumbled upon this thread which seemed to hit just the spot. But absolutely, it's more of a Spacy specific discussion. Thanks for your time!