Text classification for stock market bullish/bearish sentiment on Reddit comments

I'd like to build a textcat model to detect bullish or bearish sentiment of Reddit comments in popular stock subreddits. This is what I've done so far:

  • Downloaded all comments that mention a stock ticker from a subreddit for the past couple of days.
  • Ran the following prodigy recipe:
    prodigy textcat.manual comments_data comments.jsonl --label BEARISH,BULLISH
  • After about 500 labeled comments I trained a model with this command:
    prodigy train textcat comments_data blank:en --eval-split 0.2

We are currently in a bull market, so most comments are bullish and the bearish ROC AUC values is lower than the bullish one.

Given this imbalance I'm considering doing the following and was hoping to get some advice on this:

  • Create two data sets, bullish_comments and bearish_comments
  • Populate and label them separately with these commands:
    prodigy textcat.manual bearish_comments comments.jsonl --label BEARISH
    prodigy textcat.manual bullish_comments comments.jsonl --label BULLISH
  • When doing the bullish ones, I would hit Ignore for neutral or bearish comments. Same process when doing bearish ones.
  • Once I have an equal amount of data in both sets, I'd train the model on both.

Is the above a good way to go about this? I'm open to any advice you may have.

Hi @jrfernandez,

Your approach does sound reasonable, although I can think of other ways to do it too. For instance, you might try going back in time a bit further to get the more bearish-sentiment comments. You could also collect into one set, and then do the rebalancing separately. This might be good if you can get the model helping you to select a more balanced sample. Specifically, the active learning is often pretty good in this type of text classification task, especially if you use a terminology list to help bootstrap it.

The Insults Classifier video was one of the first videos we put up, so it refers to an earlier version of Prodigy -- so the specific commands might be slightly different now. You might still find it helpful to get an overview of the workflow, though.

@honnibal thanks! I will check out the Insults Classifier video.