textcat - by sentence or by whole document (3-5 paragraphs)

etlweather · November 24, 2019, 1:22am

Hello,

I am trying to establish if it is better to train text classification on sentences, or the whole text. The scenario is processing messages (email messages for examples).

The email messages are something short - 2-3 sentences - and sometimes longer - 3-5 paragraphs of 2-3 sentences.

Let's say I'm trying to categorize the messages in something like SPORT, POLITICS, SHOPPING, PENPAL, etc. Many of the sentences in the email may be totally not specific to the category, example, putting the time and place of the event.

E.g.:

On Monday, Bob and Sally went to town. It was a beautiful afternoon. As they arrived, they noticed that there was going to be a game of football between the two regional teams. Sally never watched a game before so she was interesting in going to it. Bob who isn't a particular fan, didn't want to but since Sally was so interested, he went for it.

During the game, they purchased some popcorn and drinks. At the end of the day, they ended up spending $500. But it was well spent, they really enjoyed the day.

In the example above, many of the sentences have nothing to do with the sport subject. I'm afraid that if I train on the whole text and then evaluate in Spacy against he whole text, the noise around the subject end up reducing the score too much. If I was to add a few more paragraphs about the two protagonists, but not about the sport event, it becomes less and less about sport and that would be a correct calculation. The text is about spot but not so much.

But in my scenario, I would want this to be tagged as SPORT. And if there are other labels, they would be attributed also.

So I think the best approach for my scenario is either by sentence or by paragraph. Reduce the text to a specific context. Train on the smaller text which has a higher reference to one subject. And analyze in Spacy by paragraphs.

Of course, splitting by paragraph isn't always that easy because people don't always maintain a good form. But I think the text I'm going to process will be split in paragraphs defined well enough that I can do so.

I am afraid that splitting at the sentence produces less contextual data for the model and also produces way more negative examples as even if a paragraph is about a subject, maybe half the sentences are still just putting context around the data being communicated.

I hope what I wrote make sense and you have a tip for me. I'm doing some experiments on this but input from others would be welcome.

etlweather · November 24, 2019, 8:02pm

I just found this thread - https://support.prodi.gy/t/text-classification-with-window - which is similar and the end comments help me in this case. I think doing the text classification training with a paragraph instead of single sentence will give me a better product than single sentence.

honnibal · November 25, 2019, 11:55am

Ultimately you can try both, but I do think that paragraph-based approaches are good.

A lot of it comes down to annotation workflow. It's really not more expensive to make the annotations at the paragraph level rather than the document level, which means you may as well collect the extra information. Sentence level annotations may or may not be more expensive than paragraph level, but certainly word level annotations are more expensive.

Text classification models are often surprisingly good at ignoring irrelevant text. However, it does depend on whether the classification is determined by keyword occurrences. If single-word features are dominant, the algorithm will ignore irrelevant text very well. But if you need more subtle cues built out of longer spans of text, the irrelevant text might hurt you more.

etlweather · November 25, 2019, 2:34pm

Ah, very good insight here! Thanks.

Topic		Replies	Views
textcat by sentence given context of larger document textcat	1	782	March 1, 2018
Text Categorization at Document level textcat , best-practices	3	1159	February 6, 2019
Apply textcat only to sentences of a document? usage , textcat , spacy , solved , off-topic	1	496	June 16, 2020
Text classification with window usage , textcat	4	851	May 12, 2019
Topic Modelling with text classification usage , textcat	1	617	November 30, 2020

textcat - by sentence or by whole document (3-5 paragraphs)

Related topics