Do stop words affect text classification in Prodigy?

Hey,

In general I would advise against stop-word removal, and I'd definitely advise against storing data that's been transformed (especially a lossy transformation like stop-word removal).

The history of stop-word removal goes like this. Early work on text classification was heavily influenced by information retrieval (search engines, basically). In early search engines, you started out by making a term-document matrix, a table where you could look up a term and get all the documents mentioning that term. This table was stored in a sparse format, so the function words like "the" and "of" would blow out the size of the table, while also being really useless (as they were in every document anyway). So you would remove those terms from the index.

Removing stop-words is also helpful in single-term text classification strategies in some respects. If you know a priori that these features cannot be helpful for your task (for instance, if you're doing topic classification), it can be helpful to put that domain insight into the model, to guide the solution that the learner comes to. It can also improve efficiency slightly, which used to be a bigger consideration. However, terms removed as "stop words" are often good features in unexpected ways. For instance, the term "I" is frequently a good feature for sentiment classification.

Removing stop-words stops making any sense once you want features that cover more than a single word at a time. You need function words to make any sense of linguistic structure, which is what neural network models are designed to do. Even if you just want to make a linear model with bigram or trigram features, you want to keep the stopwords in.

Modern optimizers and regularisation techniques are better at ignoring irrelevant features anyway, so there's not even much advantage to removing the stopwords in unigram models. If you have to pick between a rule of thumb of "never remove stopwords" and "always remove stopwords", you'd definitely go for the former policy. There might be some situations where you'd consider removing them, but I'd say it's a pretty niche technique. It wouldn't be one of the top 150 things I teach in an NLP course.