Text classification and whitespace

nlp(“Follow-up x-rays recommended”).cats
{‘FOLLOWUP’: 0.6010934710502625}

nlp(“Follow-up x-rays recommended.\n”).cats
{‘FOLLOWUP’: 0.4919336438179016}

nlp(“Follow-up x-rays recommended.”).cats
{‘FOLLOWUP’: 0.5012655258178711}

I am getting 3 different prediction probabilities with just variation of. .\n. Should they really matter?

It depends on how much training data you have, and whether the \n is in the training examples.

I’m surprised to see it so volatile, but the classification is based on a weighted sum of the input vectors, and then a non-linear prediction is made on that weighted sum. So having 1/7 of the tokens different could conceivably affect it.

If you find these sorts of variations consistently matter, it should be easy to address with data augmentation: just randomly add some examples with different whitespace like this.

If the issue comes up a lot, we should add some helpers with data augmentation functions. I could see that being useful.

My understanding was that. \n which are stop words or similar to stop words are removed from vector space, but maybe my assumption was not correct. Would it be a good idea to filter stop words or these characters from the vector space as they don’t contribute to decision making but rather used for segmentation of sentences?

None of the models use stop lists. Stop lists only really make sense for linear models with unigram features when you know you want to classify along some dimension (e.g. topic) where some common words don’t have an impact. If you’re using a CNN, or BiLSTM, or longer phrases, the words removed by stop lists are often part of important phrases.

In some documents the \n character might be important context, that changes the interpretation of the words around it. We want the model to be able to learn that, so we don’t remove it from the input.