textcat - by sentence or by whole document (3-5 paragraphs)


I am trying to establish if it is better to train text classification on sentences, or the whole text. The scenario is processing messages (email messages for examples).

The email messages are something short - 2-3 sentences - and sometimes longer - 3-5 paragraphs of 2-3 sentences.

Let's say I'm trying to categorize the messages in something like SPORT, POLITICS, SHOPPING, PENPAL, etc. Many of the sentences in the email may be totally not specific to the category, example, putting the time and place of the event.


On Monday, Bob and Sally went to town. It was a beautiful afternoon. As they arrived, they noticed that there was going to be a game of football between the two regional teams. Sally never watched a game before so she was interesting in going to it. Bob who isn't a particular fan, didn't want to but since Sally was so interested, he went for it.

During the game, they purchased some popcorn and drinks. At the end of the day, they ended up spending $500. But it was well spent, they really enjoyed the day.

In the example above, many of the sentences have nothing to do with the sport subject. I'm afraid that if I train on the whole text and then evaluate in Spacy against he whole text, the noise around the subject end up reducing the score too much. If I was to add a few more paragraphs about the two protagonists, but not about the sport event, it becomes less and less about sport and that would be a correct calculation. The text is about spot but not so much.

But in my scenario, I would want this to be tagged as SPORT. And if there are other labels, they would be attributed also.

So I think the best approach for my scenario is either by sentence or by paragraph. Reduce the text to a specific context. Train on the smaller text which has a higher reference to one subject. And analyze in Spacy by paragraphs.

Of course, splitting by paragraph isn't always that easy because people don't always maintain a good form. But I think the text I'm going to process will be split in paragraphs defined well enough that I can do so.

I am afraid that splitting at the sentence produces less contextual data for the model and also produces way more negative examples as even if a paragraph is about a subject, maybe half the sentences are still just putting context around the data being communicated.

I hope what I wrote make sense and you have a tip for me. I'm doing some experiments on this but input from others would be welcome.

I just found this thread - https://support.prodi.gy/t/text-classification-with-window - which is similar and the end comments help me in this case. I think doing the text classification training with a paragraph instead of single sentence will give me a better product than single sentence.

Ultimately you can try both, but I do think that paragraph-based approaches are good.

A lot of it comes down to annotation workflow. It's really not more expensive to make the annotations at the paragraph level rather than the document level, which means you may as well collect the extra information. Sentence level annotations may or may not be more expensive than paragraph level, but certainly word level annotations are more expensive.

Text classification models are often surprisingly good at ignoring irrelevant text. However, it does depend on whether the classification is determined by keyword occurrences. If single-word features are dominant, the algorithm will ignore irrelevant text very well. But if you need more subtle cues built out of longer spans of text, the irrelevant text might hurt you more.

Ah, very good insight here! Thanks.