Does textcat lemmatize before/during training?

I have a text classification task that I want to perform. The actual data I run the classifier on will be in past tense (e.g. VBD tags). I have a potentially large source of training data that is pretty similar to the target data, but a lot of it is in present tense. I played around with converting the training text to past tense, but the results were sometimes ungrammatical. I guess the question is, does it matter? Will a model trained on present tense data generalize to target data that is in past tense? Or, would I have to lemmatize the data myself and use a classifier from sklearn?

Hi David,

The text categorizer doesn't currently use lemmatization, although it does perform a small amount of text normalization (true-casing, certain abbreviations are expanded, etc. It uses the NORM field of spaCy's Token object for this.)

In my experience, lemmatization wouldn't play a big role in neural network models. Sparsity from things like inflection are more important in bag-of-words models, but if you use a neural network model with word vectors, it tends to model each token as a combination of features, so it's able to abstract the inflection information quite effeectively. So I wouldn't worry about it, especially for English, where the inflection isn't very rich anyway.

1 Like