Reject or skip examples for text classifier annotations

usage
textcat

(Mikael Eriksson) #1

Im going through 50k chat conversations in customer service. Im using textcat.teach to set labels on them.
Example labels: payment, order info, cancellation.

There is a lot of messages inside the conversation that are not relevant at the moment, like welcome phrases and confirmations.

My questions is if I should reject them or just simply skip them?
I started out with annotations for the payment label and got quite good score but after thousands of annotations with a majority of rejects I got close to zero in score for a relevant payment conversation.

Can you confirm that skip is a good solution at this point?


(Matthew Honnibal) #2

You might find that a two-stage pipeline will work better for your use-case here. If you have a really easy problem of rejecting 99% of your messages, it can be good to have a very simple model that performs that task as an initial filter. Then you can run your more powerful model on the remaining examples.

There are several good open-source solutions for simple text classification problems. The text classification solutions in scikit-learn are efficient and easy to use, so you might want to try that. The text classification solution in FastText is also pretty good.

Once you’ve filtered out the messages you can easily tell are irrelevant, you can work on just the relevant cases. This will make annotation faster, and should also improve your model accuracy, because the classes will be much more balanced.


(Mikael Eriksson) #3

Thank you for your response, ill try that :slight_smile: