Text classification and whitespace

madhujahagirdar · February 14, 2018, 11:02am

nlp(“Follow-up x-rays recommended”).cats
{‘FOLLOWUP’: 0.6010934710502625}

nlp(“Follow-up x-rays recommended.\n”).cats
{‘FOLLOWUP’: 0.4919336438179016}

nlp(“Follow-up x-rays recommended.”).cats
{‘FOLLOWUP’: 0.5012655258178711}

I am getting 3 different prediction probabilities with just variation of. .\n. Should they really matter?

honnibal · February 14, 2018, 3:18pm

It depends on how much training data you have, and whether the \n is in the training examples.

I’m surprised to see it so volatile, but the classification is based on a weighted sum of the input vectors, and then a non-linear prediction is made on that weighted sum. So having 1/7 of the tokens different could conceivably affect it.

If you find these sorts of variations consistently matter, it should be easy to address with data augmentation: just randomly add some examples with different whitespace like this.

If the issue comes up a lot, we should add some helpers with data augmentation functions. I could see that being useful.

madhujahagirdar · February 18, 2018, 2:36pm

My understanding was that. \n which are stop words or similar to stop words are removed from vector space, but maybe my assumption was not correct. Would it be a good idea to filter stop words or these characters from the vector space as they don’t contribute to decision making but rather used for segmentation of sentences?

honnibal · February 18, 2018, 5:50pm

None of the models use stop lists. Stop lists only really make sense for linear models with unigram features when you know you want to classify along some dimension (e.g. topic) where some common words don’t have an impact. If you’re using a CNN, or BiLSTM, or longer phrases, the words removed by stop lists are often part of important phrases.

In some documents the \n character might be important context, that changes the interpretation of the words around it. We want the model to be able to learn that, so we don’t remove it from the input.

Topic		Replies	Views
Fixing wrong whitespaces - modelling question usage , spacy	1	344	June 3, 2020
German short text textcat training - compound splitting? textcat , spacy	2	756	October 7, 2020
Sentiment of single words/phrases usage , textcat , spacy , solved	2	1035	May 2, 2019
Prodigy batch train and contextual weights usage , textcat	8	883	January 30, 2018
Text classification, model "forgets" about trained named entities after textcat.batch-train enhancement , textcat , done	6	596	June 7, 2018

Text classification and whitespace

Related topics