Sentiment of single words/phrases

katarkor · April 8, 2019, 1:29pm

Hi,
I trained a sentiment analysis model on a bunch of movie reviews using a slightly modified version of:

explosion/spaCy/blob/master/examples/training/train_textcat.py

#!/usr/bin/env python
# coding: utf8
"""Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
see the documentation:
* Training: https://spacy.io/usage/training

Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import thinc.extra.datasets

import spacy
from spacy.util import minibatch, compounding

This file has been truncated. show original

but I noticed that the model didn’t work well on short sentences, which is no wonder, since the training data was full-length articles. I was wondering if I could somehow feed Prodigy a list of negative and positive words and retrain the model so that it takes those words into account and, hopefully, works better on shorter sentences. Is this a bad or good idea to use e.g. textcat or mark recipes and feed them words/phrases that are either positive or negative on their own (without context) and label them accordingly? And how does spaCy/Prodigy deal with negation?
Another thing I’m wondering about is whether there is any significant difference between adding word vectors via spaCy’s init-model and train (using --vectors) vs training the basic model without vectors and using Prodigy’s terms.train-vectors?

honnibal · April 15, 2019, 8:39am

Is this a bad or good idea to use e.g. textcat or mark recipes and feed them words/phrases that are either positive or negative on their own (without context) and label them accordingly?

I think it could work, but you could also just take text you've labelled as positive or negative and split it into sentences, and train on those sentences.

The current model should actually do fairly okay at normalising for length. One of the things that makes short text hard is that there's just less evidence in the sample you're classifying. So, the problem might not only be the training bias. Short texts are also just harder fundamentally.

The text classification uses a convolutional neural network, so it's able to see some context around the words. This allows negation clues to be picked up during training.

There shouldn't be a significant difference there, no.

katarkor · May 2, 2019, 12:34pm

Allright, thanks for your answer

Topic		Replies	Views
Entity-level Sentiment Analysis usage , ner	1	744	January 13, 2019
Prodigy textcat train optimization?? usage , textcat , spacy	3	540	March 23, 2020
Combining NER with text classification usage , ner , textcat	10	6898	March 20, 2024
Help needed to get started with text classification usage , textcat	10	3519	January 14, 2019
Topic Modelling with text classification usage , textcat	1	618	November 30, 2020

Sentiment of single words/phrases

Related topics