Text Categorization at Document level

textcat
best-practices

(David Gallagher) #1

Hi - I’ve got a bunch of support tickets that I want to categorize; specifically, I want the model to learn a label that would apply to the whole ticket / document. The tickets are of a variety of lengths, some of them extending to multiple paragraphs of text. For annotation, should I still be be trying to annotate at the level of individual sentences, or should I move up to paragraphs? To the whole document?


(Matthew Honnibal) #2

I think sentences or paragraphs are probably a good granularity to annotate at. If you’re going to be reading the text, you may as well click the button to apply an annotation at the sentence level — it doesn’t take any extra time, really, and it gives you more detailed annotations to learn from.


(Timothy J Laurent) #3

Honnibal, when you say, “click … to apply an annotation at the sentence level” do you mean using the textcat.teach with the -L/–long-text classification mode?


(David Gallagher) #4

@timothyjlaurent, thanks for the additional comment! I hadn’t even noticed the long text classification mode. @honnibal - related (newbie) question. I have a set of 67 documents that have the classification I want to learn, but of course not every paragraph / sentence drives that classification. For training, can I use the same set of documents for both positive and negative examples, or should I plan to include another set of documents that I know don’t have the classification?

Thanks!