To begin with, its been super fun using prodigy and its features!
Me and my team have a script running to grab Tweets from stream that are earthquake-relevant.
Our goal is to multi-label these tweets as Natural_Hazards, Casualties, Impact, High_intensity e.t.c.
But i have one question about the sense2vec vectors
Should i build a custom sense2vec vector, big enough, consisting only earthquake related tweets, or should i just get a pre-trained general Twitter vector, like the one GloVe is offering (which is like 2B tweets).
And if i end up training a custom sense2vec vector, are the scripts to build that vector in the sense2vec repository enough to preprocess these data, or is it needed to hard-code some cleaning on them, for example removing punctuations, stopwords, stripping entities (like emojis, user_mentions, hashtags etc), expand contractions. Would that hard-coded preprocess on the corpus mess up the training of the vector?