Nested hierarchy for textcat

honnibal · April 3, 2018, 3:10pm

How many categories in the total hierarchy?

If your category scheme has only has a few dozen categories in total, probably the approach that will take the least coding is to keep the classification problem flat, and then use your hierarchy at runtime to find the best scoring leaf class.

Let’s say you have a hierarchy of 3 levels, each with three choices within them. So there’s 9 leaf labels: 0.0.0…2.2.2. There are also 12 (?) labels for the non-leaf categories: 0.., 1.., 2.., 0.0., 0.1., 0.2.*, etc. We would then define the probability of assigning a category as the product of the probabilities along its path. So the category 0.0.2 would be P(0.0.2)P(0.0.)P(0.). We would compute these path probabilities for each leaf to find the best-scoring leaf category.

This is the low-effort approach because spaCy’s text classifier doesn’t assume the classes are mutually exclusive. So, you don’t really need to do anything on the Prodigy side to take this approach. When you go to use the model, all you have to do is add a spaCy pipeline component that adjusts the doc.cats scores:

def adjust_textcat_path_scores(doc):
    # Logic here

nlp.add_pipe(adjust_textcat_path_scores, after='textcat')

I think there are lots of more satisfying solutions, but I’m not sure which single approach to recommend. I suspect if you asked three researchers you might get three different answers.

The disadvantage of defining entirely different models is the models won’t get to share any information. This seems inefficient. It’s probably better if the same word embeddings, CNN etc can be used for all of the models. You could have different output layers for the different levels, and share the lower layers? This might be a bit fiddly to implement. Unfortunately Thinc doesn’t currently have a hierarchical softmax function, or I would suggest that as another relatively simple alternative.

Topic		Replies	Views
Hierarchal text classification process textcat , spacy	2	582	May 17, 2021
Heirarchical text classification usage , textcat , spacy	1	601	March 25, 2021
Custom textcat for 2nd level textcat	5	656	January 23, 2023
Multiple, separate text classifications in a single model usage , textcat , solved	12	2888	September 28, 2021
Hierarchal text classification trouble shooting usage , textcat	5	541	August 17, 2021

Nested hierarchy for textcat

Related topics