Hierarchal text classification process

kayuzee · May 13, 2021, 12:25pm

Hi there,

Been using spaCy for a while, but only now getting into text classification and using Prodigy (which is freaking awesome!).

I am training and annotating a model to analyse long texts, which I have broken into sentences. I have about 20 categories that these texts could be - pretty much exclusive.

After looking at the documentation, I narrowed down the categories into 8 higher level categories, and have implemented that in prodigy.

This works fine for now, but it would be great to take these classified texts and further classify them as shown in the documentation here: Text Classification · Prodigy · An annotation tool for AI, Machine Learning & NLP

I don't understand how to actually implement this. I looked at this thread: Heirarchical text classification - #2 by ines but it sort of answers my question, but I am still not clear.

Would the process be:

Train a model for each sub classification ? I.e I have 8 categories at the higher level, thats model 1. Then for each 8 I have another classification model (total = 9 models ?)
Add each of these as a pipeline component so that my pipeline may look something like:
tokenizer -> ner -> textcat (high level) -> textcat 1.1 -> textcat 1.N ?

How does # 2 work in practice? Do i only run a pipeline for a subcategorization on that subset of data ?

It feels a bit messy, but I am sure I am just muddled in my head, any help would be appreciated.

honnibal · May 15, 2021, 2:23pm

One thing to keep in mind is that the pipeline you use for the annotation doesn't have to be the same as the pipeline you train and evaluate for production.

Aside from annotation considerations, for neural models there's not really much advantage of having a hierarchical model. The model will already be predicting the labels from a dense representation, so you don't really need to have an objective that groups them.

The big advantage of the hierarchical approach is in the annotation, because it makes the interface much easier, and reduces your cognitive load. It's much easier to accurately make the same fine-grained decision all at once, instead of having to remember the details over rare examples.

Once you've actually got all the text annotated, you can make a dataset that's just one flat classification task, and then just train one text classification model. That way the model will be faster, and probably also a bit more accurate.

kayuzee · May 17, 2021, 2:47pm

That totally just helped me make the mental model for that - thanks!

Topic		Replies	Views
Nested hierarchy for textcat usage , textcat , solved	13	1209	January 26, 2024
Heirarchical text classification usage , textcat , spacy	1	601	March 25, 2021
Two levels of classifications for text classifications usage , textcat , custom , front-end	2	866	October 20, 2020
Custom textcat for 2nd level textcat	5	656	January 23, 2023
Access to/manipulate sent.cat within TextClassifier class? usage , textcat , spacy	4	947	February 21, 2019

Hierarchal text classification process

Related topics