I'd like to build a textcat model to detect bullish or bearish sentiment of Reddit comments in popular stock subreddits. This is what I've done so far:
- Downloaded all comments that mention a stock ticker from a subreddit for the past couple of days.
- Ran the following prodigy recipe:
prodigy textcat.manual comments_data comments.jsonl --label BEARISH,BULLISH
- After about 500 labeled comments I trained a model with this command:
prodigy train textcat comments_data blank:en --eval-split 0.2
We are currently in a bull market, so most comments are bullish and the bearish ROC AUC values is lower than the bullish one.
Given this imbalance I'm considering doing the following and was hoping to get some advice on this:
- Create two data sets,
bullish_comments
andbearish_comments
- Populate and label them separately with these commands:
prodigy textcat.manual bearish_comments comments.jsonl --label BEARISH
prodigy textcat.manual bullish_comments comments.jsonl --label BULLISH
- When doing the bullish ones, I would hit Ignore for neutral or bearish comments. Same process when doing bearish ones.
- Once I have an equal amount of data in both sets, I'd train the model on both.
Is the above a good way to go about this? I'm open to any advice you may have.