Text classification for stock market bullish/bearish sentiment on Reddit comments

jrfernandez · December 7, 2020, 7:30am

I'd like to build a textcat model to detect bullish or bearish sentiment of Reddit comments in popular stock subreddits. This is what I've done so far:

Downloaded all comments that mention a stock ticker from a subreddit for the past couple of days.
Ran the following prodigy recipe:
prodigy textcat.manual comments_data comments.jsonl --label BEARISH,BULLISH
After about 500 labeled comments I trained a model with this command:
prodigy train textcat comments_data blank:en --eval-split 0.2

We are currently in a bull market, so most comments are bullish and the bearish ROC AUC values is lower than the bullish one.

Given this imbalance I'm considering doing the following and was hoping to get some advice on this:

Create two data sets, bullish_comments and bearish_comments
Populate and label them separately with these commands:
prodigy textcat.manual bearish_comments comments.jsonl --label BEARISH
prodigy textcat.manual bullish_comments comments.jsonl --label BULLISH
When doing the bullish ones, I would hit Ignore for neutral or bearish comments. Same process when doing bearish ones.
Once I have an equal amount of data in both sets, I'd train the model on both.

Is the above a good way to go about this? I'm open to any advice you may have.

honnibal · December 15, 2020, 12:47am

Hi @jrfernandez,

Your approach does sound reasonable, although I can think of other ways to do it too. For instance, you might try going back in time a bit further to get the more bearish-sentiment comments. You could also collect into one set, and then do the rebalancing separately. This might be good if you can get the model helping you to select a more balanced sample. Specifically, the active learning is often pretty good in this type of text classification task, especially if you use a terminology list to help bootstrap it.

The Insults Classifier video was one of the first videos we put up, so it refers to an earlier version of Prodigy -- so the specific commands might be slightly different now. You might still find it helpful to get an overview of the workflow, though.

jrfernandez · December 18, 2020, 4:31pm

@honnibal thanks! I will check out the Insults Classifier video.

Topic		Replies	Views
Prodigy textcat.teach data collection advice usage , textcat	1	398	March 25, 2021
Best Practices for text classifier annotations usage , textcat , best-practices	7	5005	March 24, 2021
Help needed to get started with text classification usage , textcat	10	3519	January 14, 2019
Topic Modelling with text classification usage , textcat	1	618	November 30, 2020
Imbalanced classes in a multiclass textcat leads to completely biased predictions usage , textcat	7	4022	February 21, 2018

Text classification for stock market bullish/bearish sentiment on Reddit comments

Related topics