I trained a sentiment analysis model on a bunch of movie reviews using a slightly modified version of:
but I noticed that the model didn’t work well on short sentences, which is no wonder, since the training data was full-length articles. I was wondering if I could somehow feed Prodigy a list of negative and positive words and retrain the model so that it takes those words into account and, hopefully, works better on shorter sentences. Is this a bad or good idea to use e.g.
mark recipes and feed them words/phrases that are either positive or negative on their own (without context) and label them accordingly? And how does spaCy/Prodigy deal with negation?
Another thing I’m wondering about is whether there is any significant difference between adding word vectors via spaCy’s
--vectors) vs training the basic model without vectors and using Prodigy’s
Is this a bad or good idea to use e.g.
markrecipes and feed them words/phrases that are either positive or negative on their own (without context) and label them accordingly?
I think it could work, but you could also just take text you’ve labelled as positive or negative and split it into sentences, and train on those sentences.
The current model should actually do fairly okay at normalising for length. One of the things that makes short text hard is that there’s just less evidence in the sample you’re classifying. So, the problem might not only be the training bias. Short texts are also just harder fundamentally.
The text classification uses a convolutional neural network, so it’s able to see some context around the words. This allows negation clues to be picked up during training.
There shouldn’t be a significant difference there, no.