Sentence classification problem

alessioschiavelli · February 9, 2021, 4:46pm

Hi all!
I developed a service using the ‘prodigy train texcat’ recipe with the intent to predict if a sentence can be classified within 9 different classes.
My original dataset consists of 5,000 sentences and now I can reach 15,000 sentences using some data augmentation technique (using the nlpaug library).
The results seem to be good, at least satisfactory for this first version with a TPR ranging from 0.9 to 0.95, and a similar FPR, for all the 9 classes.
The problems arise when I try to distinguish between sentences that I know can be classified into one of these 9 classes, and other generic sentences not being related in any way to the problem I’m studying.
I tried adding another class, the 10th, that I named ‘Alien class’. But TPR and FPR are not comparable with the other ones being very low. The AUC related value is something between 60% and 70% so, in brief, very close to the random guess.
Then I tried with another approach. I created a binary classifier that I can use as a pre-filter in order to pass to the classifier only the sentences I know are related to the problem.
Well, the results are unfortunately similar to the ‘Alien class’.
Any suggestion? Do you have any other approach I can follow to suggest me?
Kindly

ines · February 10, 2021, 12:57am

This definitely sounds like the better approach, so I think it makes sense to focus on that. It also gives you more flexibility and lets you focus on this step (which clearly seems to be somehow tricky) in isolation, use different training data if needed etc. How are you sourcing your training examples for the alien class, and what's the distribution? Like, when you trained the binary classifier, how many alien examples vs. non-alien examples did you have?

Topic		Replies	Views
Can't improve textcat model performance textcat	2	357	May 3, 2020
Sentence Classification textcat	2	644	October 11, 2018
Topic Modelling with text classification usage , textcat	1	574	November 30, 2020
textcat.batch-train usage , textcat	3	1230	August 29, 2018
Text classification with window usage , textcat	4	803	May 12, 2019

Sentence classification problem

Related Topics