Nested hierarchy for textcat

How many categories in the total hierarchy?

If your category scheme has only has a few dozen categories in total, probably the approach that will take the least coding is to keep the classification problem flat, and then use your hierarchy at runtime to find the best scoring leaf class.

Let’s say you have a hierarchy of 3 levels, each with three choices within them. So there’s 9 leaf labels: 0.0.0…2.2.2. There are also 12 (?) labels for the non-leaf categories: 0.., 1.., 2.., 0.0., 0.1., 0.2.*, etc. We would then define the probability of assigning a category as the product of the probabilities along its path. So the category 0.0.2 would be P(0.0.2)P(0.0.)P(0.). We would compute these path probabilities for each leaf to find the best-scoring leaf category.

This is the low-effort approach because spaCy’s text classifier doesn’t assume the classes are mutually exclusive. So, you don’t really need to do anything on the Prodigy side to take this approach. When you go to use the model, all you have to do is add a spaCy pipeline component that adjusts the doc.cats scores:

def adjust_textcat_path_scores(doc):
    # Logic here

nlp.add_pipe(adjust_textcat_path_scores, after='textcat')

I think there are lots of more satisfying solutions, but I’m not sure which single approach to recommend. I suspect if you asked three researchers you might get three different answers.

The disadvantage of defining entirely different models is the models won’t get to share any information. This seems inefficient. It’s probably better if the same word embeddings, CNN etc can be used for all of the models. You could have different output layers for the different levels, and share the lower layers? This might be a bit fiddly to implement. Unfortunately Thinc doesn’t currently have a hierarchical softmax function, or I would suggest that as another relatively simple alternative.