Hi @deewuok,
The reason the tutorial script doesn't work for you is that you're passing the sense2vec vectors as input file where raw comments data is expected.
s2v_reddit_2015_md.tar.gz
(the INPUT_DATA
in your snippet) contains the trained word vectors not the raw comments data.
It used to be possible to download the raw data here: https://files.pushshift.io/reddit/comments/. I believe that now you need to be a registered user of Reddit API to do that.
You could also skip the preprocessing step and continue the tutorial with the Prodigy ready data that is versioned in tutorial's github.