I am trying to establish if it is better to train text classification on sentences, or the whole text. The scenario is processing messages (email messages for examples).
The email messages are something short - 2-3 sentences - and sometimes longer - 3-5 paragraphs of 2-3 sentences.
Let's say I'm trying to categorize the messages in something like SPORT, POLITICS, SHOPPING, PENPAL, etc. Many of the sentences in the email may be totally not specific to the category, example, putting the time and place of the event.
On Monday, Bob and Sally went to town. It was a beautiful afternoon. As they arrived, they noticed that there was going to be a game of football between the two regional teams. Sally never watched a game before so she was interesting in going to it. Bob who isn't a particular fan, didn't want to but since Sally was so interested, he went for it.
During the game, they purchased some popcorn and drinks. At the end of the day, they ended up spending $500. But it was well spent, they really enjoyed the day.
In the example above, many of the sentences have nothing to do with the sport subject. I'm afraid that if I train on the whole text and then evaluate in Spacy against he whole text, the noise around the subject end up reducing the score too much. If I was to add a few more paragraphs about the two protagonists, but not about the sport event, it becomes less and less about sport and that would be a correct calculation. The text is about spot but not so much.
But in my scenario, I would want this to be tagged as SPORT. And if there are other labels, they would be attributed also.
So I think the best approach for my scenario is either by sentence or by paragraph. Reduce the text to a specific context. Train on the smaller text which has a higher reference to one subject. And analyze in Spacy by paragraphs.
Of course, splitting by paragraph isn't always that easy because people don't always maintain a good form. But I think the text I'm going to process will be split in paragraphs defined well enough that I can do so.
I am afraid that splitting at the sentence produces less contextual data for the model and also produces way more negative examples as even if a paragraph is about a subject, maybe half the sentences are still just putting context around the data being communicated.
I hope what I wrote make sense and you have a tip for me. I'm doing some experiments on this but input from others would be welcome.