Text Categorization at Document level

Hi - I’ve got a bunch of support tickets that I want to categorize; specifically, I want the model to learn a label that would apply to the whole ticket / document. The tickets are of a variety of lengths, some of them extending to multiple paragraphs of text. For annotation, should I still be be trying to annotate at the level of individual sentences, or should I move up to paragraphs? To the whole document?

I think sentences or paragraphs are probably a good granularity to annotate at. If you’re going to be reading the text, you may as well click the button to apply an annotation at the sentence level — it doesn’t take any extra time, really, and it gives you more detailed annotations to learn from.

Honnibal, when you say, "click ... to apply an annotation at the sentence level" do you mean using the textcat.teach with the -L/--long-text classification mode?

@timothyjlaurent, thanks for the additional comment! I hadn’t even noticed the long text classification mode. @honnibal - related (newbie) question. I have a set of 67 documents that have the classification I want to learn, but of course not every paragraph / sentence drives that classification. For training, can I use the same set of documents for both positive and negative examples, or should I plan to include another set of documents that I know don’t have the classification?

Thanks!