Been using spaCy for a while, but only now getting into text classification and using Prodigy (which is freaking awesome!).
I am training and annotating a model to analyse long texts, which I have broken into sentences. I have about 20 categories that these texts could be - pretty much exclusive.
After looking at the documentation, I narrowed down the categories into 8 higher level categories, and have implemented that in prodigy.
This works fine for now, but it would be great to take these classified texts and further classify them as shown in the documentation here: https://prodi.gy/docs/text-classification#large-label-sets
I don't understand how to actually implement this. I looked at this thread: Heirarchical text classification - #2 by ines but it sort of answers my question, but I am still not clear.
Would the process be:
- Train a model for each sub classification ? I.e I have 8 categories at the higher level, thats model 1. Then for each 8 I have another classification model (total = 9 models ?)
- Add each of these as a pipeline component so that my pipeline may look something like:
tokenizer -> ner -> textcat (high level) -> textcat 1.1 -> textcat 1.N ?
How does # 2 work in practice? Do i only run a pipeline for a subcategorization on that subset of data ?
It feels a bit messy, but I am sure I am just muddled in my head, any help would be appreciated.