Heirarchal text classification process

Hi there,

Been using spaCy for a while, but only now getting into text classification and using Prodigy (which is freaking awesome!).

I am training and annotating a model to analyse long texts, which I have broken into sentences. I have about 20 categories that these texts could be - pretty much exclusive.

After looking at the documentation, I narrowed down the categories into 8 higher level categories, and have implemented that in prodigy.

This works fine for now, but it would be great to take these classified texts and further classify them as shown in the documentation here: https://prodi.gy/docs/text-classification#large-label-sets

I don't understand how to actually implement this. I looked at this thread: Heirarchical text classification - #2 by ines but it sort of answers my question, but I am still not clear.

Would the process be:

  1. Train a model for each sub classification ? I.e I have 8 categories at the higher level, thats model 1. Then for each 8 I have another classification model (total = 9 models ?)
  2. Add each of these as a pipeline component so that my pipeline may look something like:
    tokenizer -> ner -> textcat (high level) -> textcat 1.1 -> textcat 1.N ?

How does # 2 work in practice? Do i only run a pipeline for a subcategorization on that subset of data ?

It feels a bit messy, but I am sure I am just muddled in my head, any help would be appreciated.

One thing to keep in mind is that the pipeline you use for the annotation doesn't have to be the same as the pipeline you train and evaluate for production.

Aside from annotation considerations, for neural models there's not really much advantage of having a hierarchical model. The model will already be predicting the labels from a dense representation, so you don't really need to have an objective that groups them.

The big advantage of the hierarchical approach is in the annotation, because it makes the interface much easier, and reduces your cognitive load. It's much easier to accurately make the same fine-grained decision all at once, instead of having to remember the details over rare examples.

Once you've actually got all the text annotated, you can make a dataset that's just one flat classification task, and then just train one text classification model. That way the model will be faster, and probably also a bit more accurate.

1 Like

That totally just helped me make the mental model for that - thanks!